Redshift vs Athena

Amazon Redshift vs Amazon Athena

“Big data” is a buzzword in today’s world, and many businesses are looking into how to handle their own big data. A common solution for many is cloud-based data services. Both products of Amazon, Redshift and Athena are tools that have helped build cloud-based data warehouse technologies into more interactive, current, and analytical solutions to big data problems.

While both are great means of analyzing data, each has its own advantages and disadvantages. In this tutorial, we’ll explain more about Amazon Redshift and Amazon Athena and do a comparison between the two.

Amazon Redshift

Redshift is a fully managed data warehouse that exists in the cloud. It’s based on PostgreSQL 8.0.2 and is designed to deliver fast query and I/O performance for any size dataset. Redshift first requires the user to set up collections of servers called clusters; each cluster runs an Amazon Redshift engine and holds one or more datasets. Users are then able to quickly run complicated queries and intelligently analyze the outcomes. Redshift is best used for large and structured datasets.

Amazon Athena

Athena is an interactive query service that allows you to conveniently analyze data stored in Amazon Simple Storage Service (S3) by using basic SQL. It’s completely serverless, meaning there’s no foundation that needs managing or set up, and it’s also fully portable. Athena can be used to analyze unstructured, semi-structured, and structured data stored in Amazon S3.

Now that you have a general understanding of both Redshift and Athena, let’s talk about some key differences between the two. In this tutorial, we’ll compare Amazon Redshift and Amazon Athena on basics, performance, management, and cost.

The basics

RedshiftAthena
Partitioning
  • Does not support direct partitioning by default
  • Uses predefined distribution keys to optimize tables for parallel processing
  • Poor manual partition key selection can dramatically impact query performance, so Redshift does it for you”
  • Can partition by any key with up to 20,000 per table
  • Supports several Serializer/Deserializer (SerDe) libraries for parsing data from different data formats: CSV, JSON, TSV, and Apache logs
  • Does not support arrays or object identifier types
User Defined Functions
  • Supports UDFs with scalar and aggregate functions
  • Does not support UDFs
Data Formats and Types
  • Supports several Serializer/Deserializer (SerDe) libraries for parsing data from different data formats: CSV, JSON, TSV, and Apache logs
  • Does not support arrays or object identifier types
  • Supports several Serializer/Deserializer (SerDe) libraries for parsing data from different data formats: CSV, JSON, TSV, Parquet, and ORC
  • Beneficial due to Athena’s convenient data to query structure
  • Supports complex data types like arrays, maps, and structs
Primary Key Constraint
  • Key not required
  • Can duplicate data multiple times
  • If needed, the key must be declared before data is loaded into the warehouse
  • Key not required
  • Duplication exists only if already contained in S3 datasets

Performance

For a detailed example of each product’s performance, check out this article from Panopoly.

RedshiftAthena
Startup
  • Takes several minutes
  • Requires the set-up of a cluster
  • Must manually load data into created tables
  • Can happen instantly
  • Requires no set-up
  • Can immediately begin queries on data in Amazon S3
Table Creation
  • Uses PostgreSQL
  • Faster than Athena
  • Uses Apache Hive Query Language (HQL)
  • Must specify S3 bucket location for data
  • Slower than Redshift
Query Speed
  • Built for running complex queries that can involve multiple data sources
  • Specific query performance:
    • Simple Read – Slower than Athena
    • Aggregated – Slightly slower than Athena
    • Join Query – Faster than Athena due to the ability to easily handle traditional joins and relational workloads
  • *Must use `COPY` command to move data into a table from data files or Amazon DynamoDB tables.
    • Copied files may reside in an S3 bucket, an EMR cluster, or on a remote host accessed
  • Built for running queries on a single data source, regardless of data organization
  • Specific query performance:
    • Simple Read – Faster than Redshift
    • Aggregated – Slightly faster than Redshift
    • Join Query – Slower than Redshift due to simpler focus

Management

RedshiftAthena
Security
  • A cluster security group is needed to give other users access to clusters
  • Can use Amazon’s Virtual Private Cloud to protect access to your cluster
  • Various encryption methods can be applied to protect clusters, connections, and data files
  • Use Amazon’s Identity Access Management (IAM)
  • Users must have permission to access the the S3 data locations
  • Can easily query encrypted data stored in S3 and write encrypted results back to your S3 bucket
Upgrading
  • Strictly node based, making upgrading simple: scale up a cluster by adding nodes
  • Strictly tied to S3 and operates as a managed system on top of data, therefore:
    • Limited to 100 S3 buckets per account
    • Limited to 100 databases
  • Must request higher limits for any restriction
Querying Tables
  • Must load data into warehouse first
  • Loading can be time consuming, but once loaded, queries are faster than Athena
  • Uses Presto to query tables
  • Partitioning tables helps with faster queries and performance
  • Converting data to columnar formats also helps with faster queries

Cost

RedshiftAthena
Pricing
  • Based on type and number of nodes in a cluster
  • Hourly rate for both dense compute nodes and dense storage nodes
  • Predictable price with no penalty on excess queries, but can increase overall cost with fixed compute (SSD) and storage (HDD)  
  • Price is per terabyte of data scanned during a query execution
  • Minimum of 10 megabytes per query execution
  • No charge for failed queries
  • Most cost effective when data is compressed, partitioned, or converted to a columnar format

 

Conclusion

After comparison, it’s clear to see that there’s no right or wrong answer when choosing between Amazon Redshift and Amazon Athena; the choice ultimately depends on the needs of your business. Both products provide different functions and take a different approach to cloud-based services. Redshift requires framework management and data preparation while Athena bypasses that and gets straight to querying data from Amazon S3.

Amazon Redshift excels when it comes to large, organized, and traditionally relational datasets- it does well with performing aggregations, complex joins, and inner queries. The foundation of Redshift is great for expanding data, and it’s just as simple as adding more clusters. Cost depends on data type and total usage which can create a beneficial predictability for businesses. Overall, Redshift works best for running high-performance complex queries that involve sizeable datasets.

Amazon Athena is noteworthy due to its simple yet efficient quality. No initial set up is required which makes ad hoc querying easy. It’s practical for simple read and aggregated queries and is relatively cost effective. Generally, Athena works best for quickly and conveniently running queries at a low cost without needing to set up a complex infrastructure.

About Bryn Burns

Hi! I'm Bryn Burns. I am a current senior at Virginia Tech pursuing degrees in Statistics and Mathematics. Data science and visualization are two things I'm very passionate about, as well as working with numbers and helping people learn. I'm thrilled to share my knowledge here at The Data School!