Organizations that successfully generate business value from their data will outperform their peers. These leaders are able to do new types of analytics like machine learning over new sources like log files, data from click-streams, social media, and internet connected devices stored in data lakes. This helps them to identify, and act upon opportunities for business growth faster by attracting and retaining customers, boosting productivity, proactively maintaining devices, and making informed decisions.
Given the breadth of services that AWS offers for analytics it can make the process of starting this data transformation difficult. To help this I have created a demonstration data lake for a fictional music iOS app called “Xerris Music”. In this app, users are able to stream music from all their favourite artists similar to Spotify or Apple Music. The app has started to take off now and the company wants to be able to better understand the listening and usage habits of their customers.
To accomplish this we will create a data pipeline leveraging AWS Amplify, Pinpoint and Kinesis Firehose. The pipeline will send the data to a landing S3 data lake where we will store all our raw data. We will use AWS Glue crawlers to catalog the raw data and then use a schedualed glue job that runs a spark ETL script on the data to flatten and reformat the data to parquet. This transformed data will be stored in a different bucket from which we can use other AWS tools to run queries, build dashboards and create machine learning models. In this demo we will use AWS Athena which is an ad-hoc querying engine to run SQL queries from data stored in S3. All of this will be provisioned using code with AWS CDK.
Here is a quick visual overview of the architecture we are going to build:
To start off we will get the iOS app setup with Amplify and Pinpoint. AWS Amplify is a set of tools and services that can be used together or on their own, to help front-end web and mobile developers build scalable full stack applications. AWS Pinpoint is a service that we will use for tracking user metrics. First, make sure you have the AWS Amplify CLI installed on your machine:
npm install -g @aws-amplify/cli
Then configure the CLI to your AWS account:
Then open up the iOS project and initialize Amplify and Pinpoint. Note the name you use for the Pinpoint project when you are asked, it will be used later to hook up the rest of the pipeline:
amplify add analytics
Now we need to initialize Amplify in our App Delegate file:
Now we can push up all the changes we have made and create the Amplify infrastructure in AWS:
The next step is getting to the actual app itself which looks like:
For demonstration purposes this app doesn’t actually play any music. Instead, it takes your last.fm username and pulls the listening history from their api so we can have a large set of data to work with. From their API we can get the date the track was played, the artist, as well as tags like the genre to add more dimensions to the data.
We will take these properties and creates a “SongListen” event to be stored. Pinpoint also stores events such as “SessionStart” and “SessionEnd” by default to add more data to your analytics.
Now that we have the data being sent to pinpoint it’s time to set up the rest of the pipeline to push the data to our S3 lake. We will use Kinesis Firehose for this part which is a way to reliably load near real-time streaming data into data lakes, data stores, and analytics services. It can capture, transform, and deliver streaming data to Amazon S3, Amazon Redshift, Amazon Elasticsearch Service, generic HTTP endpoints, and service providers like Datadog, New Relic, MongoDB, and Splunk.
We will use CDK to set this part up so now we will move over to an infrastructure folder in our project and start up a new CDK stack:
Inside our newly created stack let’s start off by creating our landing bucket to store our data. Then we will create a Kinesis Firehose stream to ingest the pinpoint data to go into our landing bucket. Then finally we will create an event stream in pinpoint to stream the data into Kinesis Firehose. Included is also the various IAM roles needed for all the infrastructure to access all the other services. Note that you need to enter your Pinpoint application ID you generated with Amplify in the previous step:
Now if you deploy this stack as-is you can go into the app and test it out. Enter your Last.FM username, press play and you should see the data populate in the new landing bucket of our data lake.
Data ETL and Cataloging Infrastructure
Now that we have our data stored in the landing bucket can we start generating insights from it?
The data from pinpoint is stored in nested JSON files which are difficult to query and read. Not only that but we don’t even know what schema is available to read from. To help us with this step we are going to use AWS Glue. AWS Glue is a serverless data integration service that makes it easy to discover, prepare, and combine data for analytics, machine learning, and application development. AWS Glue provides all of the capabilities needed for data integration so that you can start analyzing your data and putting it to use in minutes instead of months.
We will start by creating a Glue crawler to look at the landing bucket to generate an initial schema from the data. This will be used later to transform the data.
After the crawler is run we can take a look at the schema generated and see that most of it is nested inside structs:
Now we can take a look at the Spark script for the ETL of the data. It takes the data in the landing bucket, flattens it using the built-in “Relationalize” transform that Glue offers then writes the data in parquet format into our new bucket we will use for transformed data. Parquet is designed for efficient as well as performant flat columnar storage format of data compared to row based files like CSV or TSV files.
Now we can create all the infrastructure we will need to finish the transformations of our data. This first involves uploading our spark script to S3, then setting up buckets for the glue crawler temp data and the transformed lake. Then we can create the job for glue and give it all the parameters to write and read from. Finally we are creating a crawler to catalog all the new transformed data so we can query it later in Athena.
We can run the final crawler and take a look at our new schema which includes all our data:
Analyzing our Data
Finally we can do what we have been waiting for from the start. Now that we have our data streaming and transformed in real time, we can use AWS Athena to gather insights from the data. Athena works with the Glue schema that was created to help you query and see all the data that is availiable. You can try any normal SQL query on the dataset:
Let’s see what the top artists are:
Let’s see some of the top tags:
We have built out a full data lake and pipeline and now we can unlock the power of our data in ways we were not able to before. Also, because it is entirely created using code with CDK we can reproduce this in multiple environments quickly and easily.
Thank you for reading and you can find the full code here. If you want to find out more about data lakes and other AWS services feel free to get in touch with us at Xerris and we can help you craft innovative cloud focused solutions for your business.