Implementing your data warehouse in AWS
In the previous article we introduced cloud-based data storage solutions and the concept of lakehouses. Choosing a modern lakehouse approach gives you the flexibility to implement both a data lake and data warehouse using the same architecture. The data lake section of the lakehouse is often easier to plan as it is simply a dump of unstructured data waiting to be analyzed and cleaned before being used for more refined use cases. Therefore, we will focus on the data warehouse section as it requires more planning. Implementing the data lake portion should be similar, just with more flexible schema and organization requirements.
All data will be stored in cheap and scalable Amazon S3 object storage. However, if you are going to query all your S3 data for a particular value, you need a more efficient approach than just searching every file. Distributed object query engines such as the Presto-based Amazon Athena are able to avoid this problem by filtering on what are known as “partitions”. Here is an example of S3 keys with partitions:
weather_data/month=January/data1.parquet weather_data/month=January/data2.parquet weather_data/month=February/data.parquet
Query engines can use this key format to filter out keys that are not relevant to a query. For example with the following query:
SELECT SUM(rainfall) FROM weather_data WHERE month='February'
the engine can skip any key that does not match the pattern
.*month=February.* which can greatly reduce the amount of files that need to be read.
A large bottleneck in any computing system is IO, and this is especially true for distributed processing systems. To mitigate these issues in a lakehouse, special file formats are required. Parquet is a structured columnar-based binary file format that performs especially well for OLAP style queries where large numbers of specific columns are analyzed and often achieves ~90% compression rate over regular text files. Storing objects in parquet format not only reduces your overall storage footprint due to the compression ratio, but also increases your data retrieval performance due to the reduced IO.
Querying your data
Once you have written your data into parquet files in S3 with partitioned keys, you now need to query the data. This is where AWS Glue and Amazon Athena come in. Glue covers a variety of services, one of which is called Glue Data Catalog. This service provides managed metadata databases and tables that keep track of what partitions you have in S3 and what the file structure is like under each partition, similar to a Hive Metastore. For example, if the keys described above contained the columns
rainfall then this is an example of what a Glue Data Catalog table might include:
Table name: weather_data Schema: Name: date Type: timestamp Name: location Type: string Name: rainfall Type: double Partitions: Name: month=January Name: month=February
To create a metadata table you first have to create a Glue database that is used to organize groups of tables. Once you have created the database, you can either manually create a table that would contain the information above, or you can automatically create one using a Glue crawler. A crawler is a managed job that searches through S3 keys under a certain prefix and creates or updates a table with information on file schema and partitions. This can be especially useful if new partitions are being added frequently.
Once you have created a Glue database with at least one table, you can now query your data using Athena. Athena is a managed distributed query engine based on Presto that uses Glue Data Catalog metadata to query files in S3. In the Athena query editor in the AWS console you can write your query using the following general format:
SELECT [column name] FROM "[database name]"."[table name]" WHERE [partition]=[value]
Behind the scenes, this starts a Presto job in a serverless cluster with each query. Query syntax is compliant with ANSI SQL and the Presto documentation is a great resource for syntax specifics (https://prestodb.io/docs/current/).
Updating the schema
Adding or removing a field in a relational database can require a lot of work. However, we can use a straightforward combination of Glue data catalogs and S3 to achieve this in our lakehouse. By simply adding new files with the new field and rerunning the crawler you can update your metadata table with the new schema and this field will now become available to you when running your Athena query. Any files without this new field will simply not return any data when queried.
As of 2020, you can create Glue databases and tables for free if you store under a million objects, but in many cases it makes sense to manage your tables using crawlers. These are charged based on how long the crawler takes to complete. This can add up for large datasets, but with small datasets the cost can be very low. Glue Data Catalog has an API for creating and updating tables so another option is to use a Lambda to manage your tables programmatically.
When using Athena you are only charged based on how much data your query scans. Currently this is charged per TB so along with the low cost of S3 it can be really affordable to get started.
We have seen how you can create a powerful data storage solution in AWS with only a few managed services. Due to the serverless nature of S3, Glue, and Athena you can get started and experiment with very low upfront cost. In the last article we will see how we can manage this solution using infrastructure as code tools such as CloudFormation and Terraform.