Managing your resources in AWS

In the previous article we gave an overview of some of the resources you can use to set up a simple cloud-based data storage solution in AWS. In this last article we will see how we can create and manage those resources using infrastructure as code (IaC). IaC is now an industry standard and is crucial for managing your resources in a reliable and reproducible way. The resources we set up can be created and managed using a variety of IaC tools but in this article we will focus on CloudFormation and Terraform. At the end of the article we will provide an example using Terraform due to its strong open source community and multi-cloud compatibility.

CloudFormation

CloudFormation is the core AWS IaC tool. You combine multiple resource definitions into a “stack” and the AWS console offers an intuitive dashboard that can be used to manage the status of your resources.

One of the downsides of CloudFormation is the verbose definitions required, and as a result of this there are multiple tools such as AWS CDK and Serverless Framework that offer simpler APIs with sane defaults that “compile” down to complex CloudFormation files.

Additionally, it is a tool specific to AWS and so does not apply to other cloud providers.

Terraform

Terraform is an open source tool created and maintained by HashiCorp. Rather than define resources using JSON or YAML, it uses its own configuration language called HCL which allows for more flexibility than standard data file formats.

It supports multiple cloud providers by offering different “provider” plugins, such as AWS. The open source nature of these providers means that new features offered by cloud providers are often supported very quickly.

Unlike CloudFormation, which has its own backend managed by AWS, Terraform uses file stores such as S3 to manage resource states remotely. This allows Terraform to be completely independent of the CloudFormation service and differentiates it from other IaC tools that use CloudFormation as a backend.

Terraform configurations are provider specific so if you change cloud providers you will have to rewrite your resource definitions. However, you will not need to change IaC tools so the transition will be considerably easier.

Resources

Most of the resources required were mentioned in the previous article, but we will list each here along with a brief explanation along with links to the relevant CloudFormation and Terraform documentation.

S3 bucket

These are often created manually so that you are able to delete the stack without affecting your data.

CloudFormation: https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/aws-properties-s3-bucket.html

Terraform: https://registry.terraform.io/providers/hashicorp/aws/latest/docs/resources/s3_bucket

Partitioned S3 files

These are created programmatically by whatever method you are ingesting data, so no infrastructure is required here.

Glue database

This resource is used to organize your Glue tables.

CloudFormation: https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/aws-resource-glue-database.html

Terraform: https://registry.terraform.io/providers/hashicorp/aws/latest/docs/resources/glue_catalog_database

Glue table

This contains information about your S3 files and their partitions, and is used when querying your data through Athena. While these can be managed through IaC, it can be easier to rely on Glue crawlers to manage this resource.

CloudFormation: https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/aws-resource-glue-table.html

Terraform: https://registry.terraform.io/providers/hashicorp/aws/latest/docs/resources/glue_catalog_table

Glue crawler

These are managed jobs that search through your data in S3, discover partitions and file schemas, and create or update your Glue tables. You can set a crawler to run on a schedule to ensure your metadata table is updated regularly.

CloudFormation: https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/aws-resource-glue-crawler.html

Terraform: https://registry.terraform.io/providers/hashicorp/aws/latest/docs/resources/glue_crawler

IAM roles and policies

This is one of the most useful parts to manage with an IaC tool, especially if your stack includes additional resources such as Glue jobs that also require permissions. In our example, the Glue crawler will require a role with permissions to access S3 and create Glue tables.

CloudFormation: https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/AWS_IAM.html

Terraform: https://registry.terraform.io/providers/hashicorp/aws/latest/docs/resources/iam_role

Terraform example

This example allows you to create the Glue and S3 resources described in the previous article. The files can also be found here: https://github.com/xerris/aws-lakehouse-example

First we need to specify where we are going to store our remote state. As we are using AWS, it makes sense to use S3. To allow Terraform to connect to AWS you will need:

  • an AWS user with programmatic access enabled
  • a credentials and config file under ~/.aws/ with the user details

To do this you first need to create a user through AWS IAM with sufficient privileges to create the required resources. Once you have downloaded the user credentials you can either manually create the files or you can download and install the AWS CLI tool (https://aws.amazon.com/cli/) and run aws configure.

In the end you should have files locally that look similar to the following:

[default]
region = us-east-1
output = json
~/.aws/config
[default]
aws_access_key_id = [your id]
aws_secret_access_key = [your secret key]
~/.aws/credentials

We want our Terraform backend decoupled from any particular example, so we will use an existing bucket. You will need to update the following file with your own bucket name:

terraform {
  backend "s3" {
    bucket = "[your bucket name here]"
    key    = "glue_athena_example/terraform.tf_state"
    region = "us-east-1"
  }
}

provider "aws" {
  profile = "default"
  region  = "us-east-1"
}
main.tf

We have also defined the bucket names and file prefix as variables. Note that you will have to create datalake_s3_bucket_name manually as we want to fully decouple our data from our compute resources:

variable "datalake_s3_bucket_name" {
  type = string
}

variable "query_results_s3_bucket_name" {
  type = string
}

variable "datalake_data_prefix" {
  type = string
}
variables.tf

We need an S3 bucket to store our query results in, so we will create a new bucket. Files stored in this bucket can be viewed as temporary and if any results need to be kept, they can be moved to a more permanent location:

resource "aws_s3_bucket" "athena-results" {
  bucket        = var.query_results_s3_bucket_name
  acl           = "private"
  force_destroy = true
}
s3.tf

We will now create our Glue database, which will hold our tables, and our Glue crawler, which will create and manage our Glue tables:

resource "aws_glue_catalog_database" "mydb" {
  name = "examplegluedb"
}

resource "aws_glue_crawler" "test_crawler" {
  database_name = aws_glue_catalog_database.mydb.name
  name          = "test_crawler"
  role          = aws_iam_role.test_role.arn

  s3_target {
    path = "s3://${var.datalake_s3_bucket_name}/${var.datalake_data_prefix}/"
  }
}
glue.tf

To be able to function, our Glue crawler needs to have enough permissions to create tables and view files in S3:

resource "aws_iam_role" "test_role" {
  name               = "test_role"
  assume_role_policy = data.aws_iam_policy_document.glue-assume-role-policy.json
}

data "aws_iam_policy_document" "glue-assume-role-policy" {
  statement {
    actions = ["sts:AssumeRole"]

    principals {
      type        = "Service"
      identifiers = ["glue.amazonaws.com"]
    }
  }
}

resource "aws_iam_policy" "extra-policy" {
  name        = "extra-policy"
  description = "A test policy"
  policy      = data.aws_iam_policy_document.extra-policy-document.json

}

data "aws_iam_policy_document" "extra-policy-document" {
  statement {
    actions = [
    "s3:GetBucketLocation", "s3:ListBucket", "s3:ListAllMyBuckets", "s3:GetBucketAcl", "s3:GetObject"]
    resources = [
      "arn:aws:s3:::${var.datalake_s3_bucket_name}",
      "arn:aws:s3:::${var.datalake_s3_bucket_name}/*"
    ]
  }
}

resource "aws_iam_role_policy_attachment" "extra-policy-attachment" {
  role       = aws_iam_role.test_role.name
  policy_arn = aws_iam_policy.extra-policy.arn
}


resource "aws_iam_role_policy_attachment" "glue-service-role-attachment" {
  role       = aws_iam_role.test_role.name
  policy_arn = data.aws_iam_policy.AWSGlueServiceRole.arn
}

data "aws_iam_policy" "AWSGlueServiceRole" {
  arn = "arn:aws:iam::aws:policy/service-role/AWSGlueServiceRole"
}
iam.tf

Lastly, we will create an Athena workgroup that records our query history and defines where our query output will be stored:

resource "aws_athena_workgroup" "example-workgroup" {
  name          = "query_workgroup"
  force_destroy = true

  configuration {
    enforce_workgroup_configuration    = true
    publish_cloudwatch_metrics_enabled = true

    result_configuration {
      output_location = "s3://${aws_s3_bucket.athena-results.bucket}/query-results/"
    }
  }
}
athena.tf

Example files

You can now upload objects such as CSV files to your data bucket. As an example, you could upload each of the following files as the specified keys:

date,rainfall
2020-01-01,0.8
2020-01-02,1.7
s3://your-bucket/weather_data/month=January/data.csv
date,rainfall
2020-02-01,3.1
2020-02-02,0.5
s3://your-bucket/weather_data/month=February/data.csv

Creating your resources

To see what resources Terraform will create for you, you can run terraform plan. To actually create the resources you can run terraform apply.

Querying your data

Once you have data in S3 and have run terraform apply you should be able to run your crawler. This will create a table in your Glue database that will be accessible in Athena. Switch to the Athena workgroup you created, choose your database and table from the side bar, and select the "Preview table" option to generate and run a basic select statement.

If you used the example files above, you can experiment with other queries, like:

SELECT avg(rainfall) FROM examplegluedb.test_table limit 10;

Note how we are able to do numeric aggregations as the Glue crawler inferred the column types. We can see this in more detail with this query, which shows the inferred types for each column:

DESCRIBE examplegluedb.test_table;

Deleting your resources

One of the benefits of this architecture is that the only expense you pay for while you are not using it is S3 storage. All other services are only charged when actively using them, such as when running the Glue crawler or running Athena queries.

However, once you are done with the demo you might still want to delete the resources you created. This can be done by running terraform destroy. You will need to delete your manually created S3 buckets and data separately.

Conclusion

This series has just scratched the surface of what is possible when using an object store like S3 as the foundation for your cloud-based data warehouse or data lake. Next steps could be to get data into your structured S3 buckets using Lambdas, Kinesis, or Glue ETL operations on new or existing data. From there, you could consider more advanced architectures such as delta lakes. Additionally, Amazon EMR provides a variety of options for customizing your solution if you outgrow managed services.

View Part 1 here: https://www.xerris.com/insights/building-modern-data-warehouses-with-s3-glue-and-athena-part-1/

View Part 2 here: https://www.xerris.com/insights/building-modern-data-warehouses-with-s3-glue-and-athena-part-2/