BigQuery vs Athena

Google BigQuery vs Amazon Athena

When it comes to storing data, serverless options are growing more and more popular among businesses every day. Going serverless reduces operational, developmental, and scaling costs, as well as eases management responsibility within your business. Google and Amazon have both done an outstanding job with their take on serverless operating, and have created two similar services in the process: Google BigQuery and Amazon Athena.

While both are great means of analyzing data, each has its own advantages and disadvantages. In this tutorial, we’ll explain more about BigQuery and Athena and do a comparison between the two.

Google BigQuery

BigQuery is a serverless data warehouse that supports super-fast SQL queries using the processing power of Google’s infrastructure. It combines a couple of other Google services, Dremel, Borg, Colossus, and Jupiter into a package that’s convenient for running ad hoc queries across very large databases. Simply load your data into BigQuery and let Google handle the hard work.

Amazon Athena

Athena is an interactive query service that allows you to conveniently analyze data stored in Amazon Simple Storage Service (S3) by using basic SQL. It’s completely serverless, meaning there’s no foundation that needs managing or set up, and it’s also fully portable. Athena is primarily used to analyze unstructured, semi-structured, and structured data stored in Amazon S3.

Now that you have a general understanding of both BigQuery and Athena, let’s talk about some key differences between the two. We’ll compare Google BigQuery and Amazon Athena on basics, performance, management, and cost.

The basics

BigQueryAthena
User Defined Functions-Supports UDFs as a JavaScript function that is called as part of a query-Does not support UDFs
Data Formats and Types-Supports several data formats: CSV, JSON, Avro, Parquet, and ORC

-Supports loading from Google Cloud Datastore backups

-Supports complex data types like arrays and structs

-Supports several Serializer/Deserializer (SerDe) libraries for parsing data from different data formats: CSV, JSON, TSV, Parquet, and ORC, Avro, Logstash log files, Apache log files, CloudTrail log files
  • Beneficial due to Athena’s convenient data to query structure

-Supports loading from Amazon S3 buckets

-Supports complex data types like arrays, maps, and structs

Partitioning-Two types of partitioning:
  • based on the data’s arrival date (ideal for timeseries data)
  • based on a TIMESTAMP or DATE column
-Can partition by any key with up to 20,000 per table
Query Results -All results are cached in temporary tables for approximately 24 hours with some exceptions
  • Beneficial if you refresh queries often — if refreshes hit the cache, you don’t waste any money or resources re-computing what’s already been done
-All results are stored in Amazon S3 and each query has:
  • A results file stored automatically in a CSV format (*.csv)
  • An Athena metadata file (*.csv.metadata)

Performance

For a detailed example of each product’s performance, check out this article from Logz.io.

BigQueryAthena
Start Up-Start up can happen instantly

-Requires loading data into BigQuery

-Can immediately begin queries on data already loaded in BigQuery

-Start up can happen instantly

-Requires no set-up

-Can immediately begin queries on data in Amazon S3

Creating Tables-Uses Dremel, Google’s interactive query system to create tables

-Supports the following table types:

  • Native- backed by native BigQuery storage
  • External- backed by external storage
  • Views- virtual tables defined by a SQL query
  • Wildcard- enable you to query multiple tables using concise SQL statements
    • represents a union of all the tables that match the wildcard expression
-Uses Apache Hive to create tables

-Supports the following table types:

  • External- backed by external storage
  • Views- virtual tables defined by a SQL query

-Stores newly created table schema in a data catalog and uses it when you run queries

  • Uses schema-on-read approach:
    • a schema is projected onto your data at the time you execute a query, eliminating the need for data loading or transformation
Query Speed-Uses Dremel to run queries

-Built for running queries on massive datasets that are natively stored in BigQuery

-Specific query performance:

  • Simple Select: Faster than Athena
  • Aggregated: Faster than Athena
    • Lightning speed due to use of native tables
-Uses Presto to run queries

-Built for running queries on a smaller, single data source, regardless of data organization

-Specific query performance:

  • Simple Select: Slower than BigQuery
  • Aggregated: Slower than BigQuery
    • Partitioning tables helps with faster queries and performance
    • Converting data to columnar formats also helps with faster queries

Management

BigQueryAthena
Security-Uses Google Cloud’s Identity and Access Management (IAM)

-Users must be granted access to resources

  • To grant access, assign roles to users, groups, or service accounts

-Encrypts customer data stored at rest by default- multiple encryption options:

  • Encryption by Default- no need for further configurations
  • Customer-Managed Encryption- store keys in the cloud, for direct use by cloud services
  • Customer-Supplied Encryption- store keys on premises and use them to encrypt cloud services
-Uses Amazon’s Identity Access Management (IAM)

-Users must have permission to access the the S3 data locations

-Can easily query encrypted data stored in S3 and write encrypted results back to your S3 bucket

Data Sources-Can query data without loading it if it’s already stored in BigQuery
  • Public datasets
  • Shared datasets
  • Stackdriver log files

-External sources can be queried without loading

  • Must create a table that references the external data source
-Can only query on data that is stored in an Amazon S3 bucket

Cost

BigQueryAthena
Pricing-Storage costs are based on the amount of data stored:
  • Active- monthly charge for data stored in tables that have been modified in the last 90 days
  • Long-term- lower monthly charge for data stored in tables that have not been modified within the last 90 days

-Query cost is per terabyte with two options:

  • On-Demand- based on usage; provides flexibility
    • Minimum of 10 megabytes per query executed
    • No charge for failed queries or queries that use the cache
  • Flat-Rate- cost of all bytes processed is included in monthly rate
    • Beneficial for high-volume customers- stable monthly cost
-Storage costs are based on the amount of data stored in Amazon S3
  • Cost dependent on region
  • Four categories of storage:
    • Standard
    • Standard-Infrequent Access
    • One Zone-Infrequent Access
    • Amazon Glacier

-Query cost is per terabyte of data scanned during a query execution

-Minimum of 10 megabytes per query execution

-No charge for failed queries

-Most cost effective when data is compressed, partitioned, or converted to columnar format

Conclusion

After the above comparison, it’s clear that there’s no right or wrong answer when choosing between Google BigQuery and Amazon Athena; the choice ultimately depends on the needs of your business. Both products provide different functions and take a different approach to cloud-based services. BigQuery allows you to run SQL-like queries on multiple terabytes of data in a matter of seconds, and Athena allows you to quickly run queries on data from Amazon S3.

Google BigQuery excels when it comes to querying on petabyte-scale datasets. It’s created to query structured and semi-structured data using standard SQL, and its lightning fast speed can almost entirely be credited to its ability to work with native tables. The warehouse is cloud-based and fully managed, so operational overhead is nonexistent. Overall, BigQuery works best for running interactive and ad-hoc queries that involve sizeable datasets at a very fast pace.

Amazon Athena is noteworthy due to its simple yet efficient quality. No initial set up is required which makes ad hoc querying easy. It’s practical for simple and aggregated queries and is relatively cost effective. Generally, Athena works best for quickly and conveniently running queries at a low cost without needing to set up a complex infrastructure.

About Bryn Burns

Hi! I'm Bryn Burns. I am a current senior at Virginia Tech pursuing degrees in Statistics and Mathematics. Data science and visualization are two things I'm very passionate about, as well as working with numbers and helping people learn. I'm thrilled to share my knowledge here at The Data School!