Watch Out For Unexpected S3 Cost When Using AWS Athena
Following up on my last blog post (Using Parquet on Athena to Save Money on AWS), I wanted to share another thought about AWS Athena, specifically how the S3 bucket is being used by Athena to store query results.
On the first use of Amazon Athena, AWS will automatically create a new bucket to store the query results (bucket name aws-athena-query-results--
). Athena will store a raw result file (QueryId.csv
) and a metadata file (QueryId.csv.metadata
).
By storing the data using the QueryId
, it allows you to access previous query’s result without re-running them (saving you money since you don’t need to rescan the data).
However, you are the owner of the bucket and therefore responsible for the storage on this bucket and here is a couple of reasons why it could cost you a LOT of money:
#1 All the queries are being stored! ALL OF THEM!
AWS Athena store every query results in the bucket. Query data will just accumulate forever costing more and more money on AWS.
#2 Your data may be compressed but the results are not
AWS S3 bucket is storing the results in raw CSV. Your data may be compressed (GZIP, Snappy, …) but the results will be in raw CSV. As an example, I ran an accidental SELECT * FROM flights.parquet_snappy_data
on a 84M dataset using Apache Parquet which resulted on a 977MB file on S3.
How to fix this?
It’s actually pretty easy. If (and only if) you don’t plan to re-use old query results, make sure to setup Lifecycle on your bucket using a Transition or Expiration actions. For example, you could delete query results after 1 or 7 days. At CloudForecast, we actually don’t persist QueryIds
since it’s not useful to us so we expire the AWS S3 files after 1 day.
_Feel free to reach out if you have any questions at [email protected] or by Twitter: @francoislagier. Also, follow our journey @cloudforecast.
Want to try CloudForecast? Sign up today and get started with a risk-free 30 day free trial. No credit card required.
Manage, track, and report your AWS spending in seconds — not hours
CloudForecast’s focused daily AWS cost monitoring reports to help busy engineering teams understand their AWS costs, rapidly respond to any overspends, and promote opportunities to save costs.
Monitor & Manage AWS Cost in Seconds — Not Hours
CloudForecast makes the tedious work of AWS cost monitoring less tedious.
AWS cost management is easy with CloudForecast
We would love to learn more about the problems you are facing around AWS cost. Connect with us directly and we’ll schedule a time to chat!