Managing your resources in AWS
In the previous article we gave an overview of some of the resources you can use to set up a simple cloud-based data storage solution in AWS. In this last article we will see how we can create and manage those resources using infrastructure as code (IaC). IaC is now an industry standard and is crucial for managing your resources in a reliable and reproducible way. The resources we set up can be created and managed using a variety of IaC tools but in this article we will focus on CloudFormation and Terraform. At the end of the article we will provide an example using Terraform due to its strong open source community and multi-cloud compatibility.
CloudFormation is the core AWS IaC tool. You combine multiple resource definitions into a “stack” and the AWS console offers an intuitive dashboard that can be used to manage the status of your resources.
One of the downsides of CloudFormation is the verbose definitions required, and as a result of this there are multiple tools such as AWS CDK and Serverless Framework that offer simpler APIs with sane defaults that “compile” down to complex CloudFormation files.
Additionally, it is a tool specific to AWS and so does not apply to other cloud providers.
Terraform is an open source tool created and maintained by HashiCorp. Rather than define resources using JSON or YAML, it uses its own configuration language called HCL which allows for more flexibility than standard data file formats.
It supports multiple cloud providers by offering different “provider” plugins, such as AWS. The open source nature of these providers means that new features offered by cloud providers are often supported very quickly.
Unlike CloudFormation, which has its own backend managed by AWS, Terraform uses file stores such as S3 to manage resource states remotely. This allows Terraform to be completely independent of the CloudFormation service and differentiates it from other IaC tools that use CloudFormation as a backend.
Terraform configurations are provider specific so if you change cloud providers you will have to rewrite your resource definitions. However, you will not need to change IaC tools so the transition will be considerably easier.
Most of the resources required were mentioned in the previous article, but we will list each here along with a brief explanation along with links to the relevant CloudFormation and Terraform documentation.
These are often created manually so that you are able to delete the stack without affecting your data.
Partitioned S3 files
These are created programmatically by whatever method you are ingesting data, so no infrastructure is required here.
This resource is used to organize your Glue tables.
This contains information about your S3 files and their partitions, and is used when querying your data through Athena. While these can be managed through IaC, it can be easier to rely on Glue crawlers to manage this resource.
These are managed jobs that search through your data in S3, discover partitions and file schemas, and create or update your Glue tables. You can set a crawler to run on a schedule to ensure your metadata table is updated regularly.
IAM roles and policies
This is one of the most useful parts to manage with an IaC tool, especially if your stack includes additional resources such as Glue jobs that also require permissions. In our example, the Glue crawler will require a role with permissions to access S3 and create Glue tables.
This example allows you to create the Glue and S3 resources described in the previous article. The files can also be found here: https://github.com/xerris/aws-lakehouse-example
First we need to specify where we are going to store our remote state. As we are using AWS, it makes sense to use S3. To allow Terraform to connect to AWS you will need:
- an AWS user with programmatic access enabled
~/.aws/with the user details
To do this you first need to create a user through AWS IAM with sufficient privileges to create the required resources. Once you have downloaded the user credentials you can either manually create the files or you can download and install the AWS CLI tool (https://aws.amazon.com/cli/) and run
In the end you should have files locally that look similar to the following:
We want our Terraform backend decoupled from any particular example, so we will use an existing bucket. You will need to update the following file with your own bucket name:
We have also defined the bucket names and file prefix as variables. Note that you will have to create
datalake_s3_bucket_name manually as we want to fully decouple our data from our compute resources:
We need an S3 bucket to store our query results in, so we will create a new bucket. Files stored in this bucket can be viewed as temporary and if any results need to be kept, they can be moved to a more permanent location:
We will now create our Glue database, which will hold our tables, and our Glue crawler, which will create and manage our Glue tables:
To be able to function, our Glue crawler needs to have enough permissions to create tables and view files in S3:
Lastly, we will create an Athena workgroup that records our query history and defines where our query output will be stored:
You can now upload objects such as CSV files to your data bucket. As an example, you could upload each of the following files as the specified keys:
Creating your resources
To see what resources Terraform will create for you, you can run
terraform plan. To actually create the resources you can run
Querying your data
Once you have data in S3 and have run
terraform apply you should be able to run your crawler. This will create a table in your Glue database that will be accessible in Athena. Switch to the Athena workgroup you created, choose your database and table from the side bar, and select the "Preview table" option to generate and run a basic select statement.
If you used the example files above, you can experiment with other queries, like:
SELECT avg(rainfall) FROM examplegluedb.test_table limit 10;
Note how we are able to do numeric aggregations as the Glue crawler inferred the column types. We can see this in more detail with this query, which shows the inferred types for each column:
Deleting your resources
One of the benefits of this architecture is that the only expense you pay for while you are not using it is S3 storage. All other services are only charged when actively using them, such as when running the Glue crawler or running Athena queries.
However, once you are done with the demo you might still want to delete the resources you created. This can be done by running
terraform destroy. You will need to delete your manually created S3 buckets and data separately.
This series has just scratched the surface of what is possible when using an object store like S3 as the foundation for your cloud-based data warehouse or data lake. Next steps could be to get data into your structured S3 buckets using Lambdas, Kinesis, or Glue ETL operations on new or existing data. From there, you could consider more advanced architectures such as delta lakes. Additionally, Amazon EMR provides a variety of options for customizing your solution if you outgrow managed services.