Here is a buzz-word that we hear all the time
But what is precisely a Data Lake? Is it an actual lake of data filled with file cabinets? Is it some magical place where all the data “sit”? Is it a big database that costs hundreds of thousands of dollars per month?
A Data Lake is just a place where all data are stored, structured, and unstructured. It is a centralized hub for all the data in every format. Wait, that was it? Well yes, but what is the point of having a data lake that is hard to access. So let’s add one more requirement. The data have to be accessible at a reasonable cost. In my opinion, this is a successful Data Lake.
In this post, we learn how to create a Serverless Data Lake with Amazon S3, Amazon Athena, and AWS Glue.
- Level 200
- Cost of the experiment: 0,72$
- AWS Glue Crawler: 0.15$
- Amazon S3: 0.07$ for ~100GB for one day
- Amazon Athena: 0.50$ for ~100GB of scan
Do not forget to delete the data from your bucket otherwise the 0.07$ will be 2.3$
Amazon S3 or Amazon Simple Storage Service is an object storage service that is Serverless therefore scalable. It is very durable (99.999999999%), available, and reliable. Amazon S3 is the place where we store all of our data.
AWS Glue is a fully managed ETL service that allows us to transform our files in a specified format to optimize reading time. We also map the schema of our table with AWS Glue.
Amazon Athena is an interactive querying service that allows us to query our data through SQL. Under the hood, Athena runs on-top of Apache Hive like system with Presto. It is serverless, that means no need to set up anything.
First, we need to store all of our files in S3. I usually archive all of my data that are not reproducible, in a bucket at Intelligent Tiering.
Considering Amazon Athena, there is a lovely post that lists all the best practices for that tool. We will do a quick review of the storing part for now.
- Partition your data. Make sure you partition your data correctly to avoid scanning unnecessary data. Amazon Athena is charging by GB per scan.
- Split your data into smaller parts, to parallelize the processing. Do not overdo it; each part should be about 300MB
- Format your data and compress them. Apache parquet not only lowers your costs because it is a column stored format, but it also makes the scanning faster. Compression will lower the price because Amazon Athena is charging per compressed GB and not the uncompressed (assuming you have compressed your data). SNAPPY is a lovely choice because it allows streaming while it is compressed, in contrast with GZIP. GZIP has excellent compression, but Athena has to uncompress and then search, which makes it a bit slower.
We use the Amazon Custom Review Dataset to populate Amazon Athena with data
I used the following command from the documentation to copy the data into my S3 bucket.
aws s3 cp s3://amazon-reviews-pds/tsv/ s3://<my-bucket>/archive/ --recursive
This command starts downloading the TSV files from the amazon reviews. It is around 30GB. We could have downloaded the Apache Parquet files where Athena would be using the optimal format, but I wanted to show you a new feature of Athena, the CREATE TABLE, from a query.
Once we download everything, we go and delete the index.txt file.
Now that we have all the data, we go to AWS Glue to run a crawler to define the schema of the table.
Following the steps below, we will create a crawler
We select the crawlers in AWS Glue, and we click the Add crawler button. Then, we see a wizard dialog asking for the crawler’s name.
Add a name, and click next
Select our bucket with the data. In our case, it should be something like s3://<mybucket>/archive
Create a new role for Glue to have access to our bucket.
Select the database, fill the prefixes (if we want to), and choose the Create a single schema for each S3 path, to combine different columns from different files into one table. Once we finish, run the crawler. After 1 minute or so, when the crawler ends the schema reading, we see a table in Amazon Athena.
Now that we have the data in Athena, let’s test them.
SELECT * FROM "datalake-raw"."test_data_archive" LIMIT 10;
We have a touchdown! Now we can read, query, and aggregate our data!
SELECT distinct product_category FROM "datalake-raw"."test_data_archive" (Run time: 1 minute 40 seconds, Data scanned: 30.14 GB)
OMG, that took some time!
We are going to improve the performance by creating a new table with the same data, but this time transform the data from TSV to Apache Parquet.
(Run time: 3 minutes 38 seconds, Data scanned: 30.15 GB)
Well, that took some time and some scanning, but now we have a new table in our default database that I just ensured you that is optimized. We have used Apache Parquet with compression SNAPPY (lines 2, 3), and used the columns marketplace and product_category as partitions (lines 4, 22, 23). Notice that I used both partitions at the end of the query; this is how it works. Let’s see now… is the claim of optimization valid?
SELECT distinct product_category FROM "default"."test_data_parquet" (Run time: 2.81 seconds, Data scanned: 0 KB)
Oh well, look at that! Instead of 155 seconds and 32GB of scanned data, it took only 3 seconds and… 0KB?? Is that right? Yes, it is! Since it is a partition we have keys in S3 in the following format:
That means, Athena only has to look at the distinct keys under product_category and not the actual data. Ok, that is technically cheating, we could have done that with the TSVs too. Let’s try another example.
SELECT distinct product_parent FROM "datalake-raw"."test_data_archive" (Run time: 1 minute 41 seconds, Data scanned: 30.15 GB)
Ok, that seems promising…
SELECT distinct product_parent FROM "datalake-raw"."test_data_parquet" (Run time: 16.49 seconds, Data scanned: 656.79 MB)
That is an improvement! From 101 seconds and 30GB of scanned data, we dropped at 17 seconds and 660MB of scanned data. But at what cost?
Let’s see the space occupied in S3:
- TSV – GZIP: 29.2GB on S3
- Parquet – SNAPPY: 43.5 GB on S3
That is not very much! Considering we dropped in time by a factor of 10 and in data scanned by a factor of 100!
As you can see, we have created a Serverless Data Lake that can scan 43.5 BG of data in seconds. I admit we had to make some improvements and transformations, but in the bottom line, we are querying raw data with minimal cost and effort. Let’s make some naive assumptions:
Let this be our Data Lake for one year, with no additions on files. We calculate the cost to the following:
- AWS Glue Crawler: 0.15$ – Only ran once
- S3 cost: 44GB + 40GB = 84GB * 12 months * 0.0245 per month per GB = ~25$
- Athena cost: 5$ per TB, our queries with the parquet files were at MB, let’s say we overused it and we scanned 2TB = 10$
It comes to a total of ~36$ for hosting 84GB of data that is immediately accessible by common SQL. I think that is pretty impressive!
That’s it for today! Let me know in the comments section below or at my Twitter @siaterliskonsta if you have any specific requests about what you want to see next on AWS. Till next time, have a beautiful day!