Amazon S3 Tables using Glue Jobs and Terraform
Enable managed iceberg tables in AWS and ingest data with Glue Jobs
In this “unofficial” series of Amazon’s Data Stack showcasing, we will introduce Amazon S3 Tables. Amazon S3 Tables is a new AWS feature that makes Apache Iceberg-based tables natively available in Amazon S3—without needing to manage Iceberg yourself. So far, we have seen Amazon Athena, how to query data, such as CUR reports, and how to use Glue Jobs. We will use our knowledge so far to do a small dive into the S3 tables.
What we’ll see:
How to provision S3 Tables with Terraform, including table buckets, namespaces, and Iceberg tables.
Set up a workflow with Step Functions and Glue Jobs to process CSV files and ingest the data into our Iceberg tables.
Managing the access with Lake Formation (basic)
Ability to query our Iceberg tables from Athena, which will make the provisioning of Lake Formation mandatory🥲
We will not see the S3 Tables table maintenance in this post. You can read more about table maintenance here.
This is a Level 300 post; following along with the post and deploying the infrastructure to your AWS account will cost approximately ~3.5$, given that you run the Glue Job around 10 times and have around 1GB of files in S3 buckets. You can follow along with the code in my repo.
What are S3 Tables
Amazon S3 Tables is a purpose-built storage solution within Amazon S3, explicitly optimized for analytics workloads that use tabular data. Unlike general-purpose S3 buckets, S3 Tables introduces a new bucket type—table buckets—designed to efficiently store and manage data in a table-like structure (rows and columns), such as transaction logs, sensor data, or event streams.
S3 Tables natively supports the Apache Iceberg format, enabling advanced features like schema evolution, partition evolution, and time travel queries. This means you can query your data using standard SQL with analytics engines that support Iceberg, such as Amazon Athena, Amazon Redshift, and Apache Spark.
Key features of S3 Tables include:
Optimized performance for high-throughput analytics queries.
Automated table maintenance (compaction, snapshot management, and cleanup of unused files).
Fine-grained access control using IAM and Lake Formation
Seamless integration with AWS analytics services, allowing easy discovery and querying of your tabular data.
What You’ll Learn
In this post, you’ll build an automated way of processing CSV files into Iceberg using Step Functions, Lambdas, Glue Jobs, and most importantly, S3 Tables.
Key takeaways:
Lake Formation is mandatory when you want to expose your data to Athena and other analytical services within AWS
Unfortunately, there will be some manual work from the Console regarding Lake Formation
No need to activate lake formation if the Iceberg tables are only for your Spark jobs
Setting the Stage - Terraform
As you can see in the diagram above, we will set up a landing bucket for our data, and when we upload a CSV file, a lambda will be triggered. This lambda will trigger a step function (this is in case you want to expand the functionality) that will run the Glue Job to process the CSV file. The data will land in the S3 Table, and we will be able (with proper access) to query the data in Athena as well.
Terraform
In the Terraform code below, I will remove the IAM statements and things we have covered multiple times in our previous posts. I will focus on the essential things regarding our pipeline. You can find the full code here.
As always, we will start from the S3 buckets listed in s3.tf. We create two buckets, one for landing our CSV files named raw_data_bucket and one for the artifacts of the Glue Job.
module "artifacts_bucket" {
source = "terraform-aws-modules/s3-bucket/aws"
version = "~> 4.8"
bucket = "${local.project_name}-glue-artifacts-${local.environment}"
force_destroy = true
acl = "private"
# Add ownership controls
control_object_ownership = true
object_ownership = "ObjectWriter"
tags = local.tags
}
module "raw_data_bucket" {
source = "terraform-aws-modules/s3-bucket/aws"
version = "~> 4.10"
bucket = local.raw_data_bucket_name
force_destroy = true
acl = "private"
# Add ownership controls
control_object_ownership = true
object_ownership = "ObjectWriter"
tags = local.tags
}
Now let’s create the Glue Job, which will take the file from the raw bucket and convert it to Iceberg format. File: glue.tf
# Upload the S3 Tables Iceberg connector JAR
resource "aws_s3_object" "s3_tables_connector" {
bucket = module.artifacts_bucket.s3_bucket_id
key = "jars/s3-tables-catalog-for-iceberg-runtime-0.1.5.jar"
source = "${path.module}/jars/s3-tables-catalog-for-iceberg-runtime-0.1.5.jar"
etag = filemd5("${path.module}/jars/s3-tables-catalog-for-iceberg-runtime-0.1.5.jar")
}
# Upload the Glue job script to S3
resource "aws_s3_object" "glue_job_script" {
depends_on = [aws_s3_object.s3_tables_connector]
bucket = module.artifacts_bucket.s3_bucket_id
key = "scripts/csv_to_iceberg.py"
source = "${path.module}/scripts/csv_to_iceberg.py"
etag = filemd5("${path.module}/scripts/csv_to_iceberg.py")
}
# Glue job definition
resource "aws_glue_job" "csv_to_iceberg" {
depends_on = [aws_s3_object.glue_job_script]
name = "${local.project_name}-csv-to-iceberg"
role_arn = aws_iam_role.glue_job_role.arn
command {
name = "glueetl"
script_location = "s3://${module.artifacts_bucket.s3_bucket_id}/${aws_s3_object.glue_job_script.key}"
python_version = "3"
}
default_arguments = {
"--job-language" = "python"
"--job-bookmark-option" = "job-bookmark-enable"
"--enable-metrics" = "true"
"--enable-continuous-cloudwatch-log" = "true"
"--TempDir" = "s3://${module.raw_data_bucket.s3_bucket_id}/temp/"
"--extra-jars" = "s3://${module.artifacts_bucket.s3_bucket_id}/jars/s3-tables-catalog-for-iceberg-runtime-0.1.5.jar"
}
execution_property {
max_concurrent_runs = 2
}
glue_version = "5.0"
worker_type = "G.1X"
number_of_workers = 2
timeout = 15
}
resource "aws_iam_role" "glue_job_role" {
name = "${local.project_name}-glue-job-role"
assume_role_policy = jsonencode({
Version = "2012-10-17"
Statement = [
{
Action = "sts:AssumeRole"
Effect = "Allow"
Principal = {
Service = "glue.amazonaws.com"
}
}
]
})
}
resource "aws_iam_role_policy_attachment" "glue_service" {
role = aws_iam_role.glue_job_role.name
policy_arn = "arn:aws:iam::aws:policy/service-role/AWSGlueServiceRole"
}
This is pretty much the same as what we used in our previous post (see below), the only difference is that we upload the JAR file for managing the Iceberg tables.
Using AWS Glue Jobs with Terraform
The road so far… *Carry on, my wayward son is playing in my mind*. In a previous post, we saw Amazon Athena and how to query data, such as CUR reports. Amazon Athena is a fantastic tool provided by AWS, a serverless managed Presto that allows you to query your unstructured and semi-structured data at a very low cost. In this post, we will continue our j…
Regarding the Step Function, it is as simple as it gets; it is composed of a single step where it runs the Glue Job. Feel free to extend this and add more capabilities and let me know in the comments below what you did. File: state_machine.tf
# Step Functions state machine definition
module "etl_state_machine" {
source = "terraform-aws-modules/step-functions/aws"
version = "~> 4.2.1"
name = "${local.project_name}-etl-workflow"
attach_policy_json = true
policy_json = data.aws_iam_policy_document.step_functions_glue_policy.json
definition = jsonencode({
Comment = "ETL workflow to process CSV to Iceberg and crawl the data",
StartAt = "StartGlueJob",
States = {
"StartGlueJob" = {
Type = "Task",
Resource = "arn:aws:states:::glue:startJobRun.sync",
Parameters = {
JobName = aws_glue_job.csv_to_iceberg.name,
Arguments = {
"--source_s3_path.$" = "$.source_s3_path",
"--table_namespace.$" = "$.table_namespace",
"--table_name.$" = "$.table_name",
"--table_bucket_arn.$" = "$.table_bucket_arn"
}
},
ResultPath = "$.glueJobResult",
End = true
}
}
})
}
Now let’s put everything together. Let’s create a trigger on the S3 bucket every time we upload a CSV and trigger the step function accordingly using a Lambda. File: lambdas.tf
# ECR Docker image for Lambda
module "docker_image" {
source = "terraform-aws-modules/lambda/aws//modules/docker-build"
ecr_repo = module.ecr.repository_name
source_path = "${path.module}/lambdas"
use_image_tag = true
}
module "ecr" {
source = "terraform-aws-modules/ecr/aws"
repository_name = "${local.project_name}-ecr"
repository_force_delete = true
create_lifecycle_policy = false
repository_lambda_read_access_arns = [module.trigger_step_function.lambda_function_arn]
}
module "trigger_step_function" {
source = "terraform-aws-modules/lambda/aws"
version = "~> 7.20"
function_name = "${local.project_name}-trigger-step-function"
description = "Lambda function to trigger Step Function when a file is uploaded to S3"
create_package = false
image_uri = module.docker_image.image_uri
package_type = "Image"
timeout = 300
memory_size = 512
environment_variables = {
GLUE_JOB_NAME = aws_glue_job.csv_to_iceberg.name
STATE_MACHINE_ARN = module.etl_state_machine.state_machine_arn
TABLE_NAMESPACE = aws_s3tables_namespace.iceberg_namespace.namespace
TABLE_NAME = local.table_name
TABLE_BUCKET_ARN = module.s3_tables_bucket.s3_table_bucket_arn
}
image_config_command = ["trigger_step_function.handler"]
attach_policies = true
policies = [
"arn:aws:iam::aws:policy/service-role/AWSLambdaBasicExecutionRole",
aws_iam_policy.lambda_glue_access.arn,
aws_iam_policy.lambda_step_functions_policy.arn
]
number_of_policies = 3
tags = local.tags
}
resource "aws_s3_bucket_notification" "bucket_notification" {
bucket = module.raw_data_bucket.s3_bucket_id
lambda_function {
lambda_function_arn = module.trigger_step_function.lambda_function_arn
events = ["s3:ObjectCreated:*"]
filter_prefix = "input/"
filter_suffix = ".csv"
}
depends_on = [aws_lambda_permission.allow_bucket]
}
resource "aws_lambda_permission" "allow_bucket" {
statement_id = "AllowExecutionFromS3Bucket"
action = "lambda:InvokeFunction"
function_name = module.trigger_step_function.lambda_function_arn
principal = "s3.amazonaws.com"
source_arn = "arn:aws:s3:::${module.raw_data_bucket.s3_bucket_id}"
}
Again, make sure you check the files, because I am vomiting the IAM resources.
Now let’s see the protagonist of our post, the S3 Table. File: main.tf
module "s3_tables_bucket" {
source = "terraform-aws-modules/s3-bucket/aws//modules/table-bucket"
version = "~> 4.10"
table_bucket_name = local.s3_tables_bucket_name
encryption_configuration = {
kms_key_arn = module.kms.key_arn
sse_algorithm = "aws:kms"
}
maintenance_configuration = {
iceberg_unreferenced_file_removal = {
status = "enabled"
settings = {
non_current_days = 7
unreferenced_days = 3
}
}
}
create_table_bucket_policy = true
table_bucket_policy = data.aws_iam_policy_document.s3_tables_bucket_policy.json
}
# S3 Tables Namespace - requires a table bucket
resource "aws_s3tables_namespace" "iceberg_namespace" {
namespace = local.namespace_name
table_bucket_arn = module.s3_tables_bucket.s3_table_bucket_arn
}
S3 Table is a different kind of S3 bucket under the Table Buckets section in the AWS console. Once you have created the table, you will find it there. You can see that I have added some simple maintenance (data housekeeping), but more functionality can be added if you want to.
Since we want to query our Iceberg table with Athena, we must enable the integration with the AWS analytics services. This will use Lake Formation for access management and Cataloging.
To convert the CSV file from the S3 bucket to the Iceberg, we will use the following code (part of csv_to_iceberg.py)
# Initialize spark session with integration to analytics services
spark = SparkSession.builder.appName("SparkIcebergSQL") \
.config("spark.sql.extensions", "org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions") \
.config("spark.sql.defaultCatalog", "s3tables") \
.config("spark.sql.catalog.s3tables", "org.apache.iceberg.spark.SparkCatalog") \
.config("spark.sql.catalog.s3tables.catalog-impl", "org.apache.iceberg.aws.glue.GlueCatalog") \
.config("spark.sql.catalog.s3tables.glue.id", f"{ACCOUNT_ID}:s3tablescatalog/{TABLE_BUCKET_NAME}") \
.config("spark.sql.catalog.s3tables.warehouse", f"s3://{TABLE_BUCKET_NAME}/warehouse/") \
.getOrCreate()
First, we initiate the Spark Session with the integration to analytics services.
dynamic_frame = glueContext.create_dynamic_frame.from_options(
format_options={
"quoteChar": '"',
"withHeader": True,
"separator": ",",
"optimizePerformance": False,
},
connection_type="s3",
format="csv",
connection_options={
"paths": [SOURCE_S3_PATH],
"recurse": True
},
transformation_ctx="read_csv"
)
Then we read the CSV file from the S3 bucket and place it into a dynamic frame.
columns = df.dtypes
columns_sql = ", ".join([f"{slugify(name)} {dtype.upper()}" for name, dtype in columns])
table_identifier = f"{TABLE_NAMESPACE}.{TABLE_NAME}"
create_table_sql = f"""
CREATE TABLE IF NOT EXISTS {table_identifier} (
{columns_sql}
)
"""
spark.sql(create_table_sql)
Then we infer the schema of the CSV and create the table in the Table Bucket.
df.createOrReplaceTempView("temp_data_to_insert")
insert_sql = f"""
INSERT INTO {table_identifier}
SELECT * FROM temp_data_to_insert
"""
spark.sql(insert_sql)
Last but not least, we insert the data into the table.
Set up the environment
Now, to put everything in place, we will use Terraform. As I said in the beginning, we will have some manual Steps.
First of all, run:
terraform apply
This will create the resources, but you will not be able to run the script since we will need to enable Lake Formation and provide the necessary access to our Glue Role.
First of all, you will need to enable the Integration:
Open the Amazon S3 console at https://console.aws.amazon.com/s3/.
In the left navigation pane, choose Table buckets.
Click on Enable integration
Once you do that, your Bucket will be registered as a catalog in Lake formation with the name s3tablescatalog. Navigate to the Lake Formation page and do the following:
Go to Permissions and then click on Data permissions and then click Grant
Select Principals and then Iam user and Roles. Select your Glue Role
Then select Named Data Catalog resources and
Select your catalog with the bucket name at the end
Select your database
Click at the bottom Super access for Database permissions and Grantable permissions (You can narrow down the scope of the access if you want to by adding only Describe, Create table, and Alter)
We do the same thing, but this time we select all Tables as well
I’ve tried to do this with Terraform, but it seems there is a bug for referencing an S3 Tables catalog.
Last but not least, you will need to uncomment the resource here.
resource "aws_lakeformation_permissions" "data_location" {
principal = aws_iam_role.glue_job_role.arn
permissions = ["DATA_LOCATION_ACCESS"]
data_location {
arn = module.s3_tables_bucket.s3_table_bucket_arn
}
}
And run again
terraform apply
And you are set. Upload a CSV file in your raw bucket under the prefix input and see it run 😄
Conclusion
Exploring S3 Tables has been a game-changer in my thinking about storing and analyzing tabular data in the cloud. The performance, built-in Iceberg support, and “seamless integration” (besides Lake Formation) with AWS analytics tools made setting up a modern, scalable data lake easy.
If you’re curious about the next generation of data lake storage, I definitely recommend giving S3 Tables a try. It’s been a fun and eye-opening experience, and I’m excited to see how this new approach will shape future analytics projects!
To destroy what we have created today, simply run
terraform destroy
Feel free to reach out if you encounter any problems or have suggestions.
Till the next time, stay safe and have fun! ❤️