In order to build a Data Lake, we need to choose a database to host it on. Historically, and still today at massive (> 100GB/day) scale, the Lake was stored in a file system like S3 buckets.
Today, with storage being so cheap and warehouses being so scalable, we recommend putting your lake data directly into what is called a Warehouse Engine. This will make creating the Data Warehouse much simpler as we’ll cover once we get to that next stage.
What is a Warehouse Engine?
In 2005 a combined group from Brown University, Brandeis University, and MIT released a ground breaking paper know as the C-Store paper introducing a new column store architecture. The many developments in that paper led to a new class of cloud based databases that can very powerfully handle large sets of data.
These engines are geared toward analytic workloads that require larger, but less frequent queries than their transactional counterparts. Transactional databases like PostgreSQL are optimized to do quick reads and writes at incredibly high volumes in order to run the applications that they serve. Analytic use cases query data way less frequently, but their queries are usually more complex and over larger sets of data.
If these are vehicles, transactional databases are motorcycles capable of many quick trips while warehouse engines are semis doing fewer trips but hauling large loads.
The biggest decision to make when moving from production to a lake is what database you will use. Most people consider:
There are a variety of database pricing models, from being based on storage to being based on the amount of data queried. If your company is strictly using Amazon or Google as your software vendors, this can dictate your vendor choice as well.
The architecture of the Data Lake has implications on how it’ll help your operations scale. Differences in the many types of lakes entail columnar vs. row-oriented storage, and having storage and compute together or separated. If there are requirements for ongoing maintenance of your Data Lake you will want to know that as well.
When selecting the right data engine for your organization, you may also consider whether you want an on-premise or cloud solution. More and more businesses are moving to cloud solutions to take advantage of the “as a service” model and save on hardware costs so, we’ll focus on cloud databases in this section.
Modern Warehouse Engine Products
Today, there are three dominant choices for cloud based data warehouse engines: Amazon Redshift, Google BigQuery and Snowflake. Note - all of these are similar and based on the C-Store paper.
Redshift has the benefits of ease of use, speed, and cost. Being a part of AWS, there is full service integration for the wide range of AWS services such as S3 for storage and CloudWatch for infrastructure monitoring. Redshift is generally cheaper than Snowflake or BigQuery, with a couple of pricing options such as paying hourly per node or paying by number of bytes scanned with Redshift Spectrum. It’s simple to set up and scale by adding nodes to your cluster and increasing storage and performance.
Redshift is probably the most popular, although it is losing ground to Snowflake. It benefits from being similar in connection and SQL syntax to PostgreSQL.
Users can often run into concurrency issues with Redshift if it isn’t set up properly or if there are high volumes of queries from many users accessing the database. Ongoing maintenance may be required with Redshift to resize clusters, define sort keys, and vacuum data.
Like many AWS services there are ways to customize your configuration with workload management, compression, and partitioning. But these advanced features are not very out of the box. So although Redshift is powerful it may require a dedicated resource from your data engineering team.
BigQuery is not bound by cluster capacity of storage or compute resources, so it scales and performs very well with increasing demands for concurrency (e.g. more users and queries accessing the database). As a fully managed database, BigQuery handles vacuums and resizing on its own which can save time for your data engineers and makes it easy to use and maintain. For businesses using Google products, BigQuery integrates well with Google Drive and Google Analytics.
Cost is determined per query byte, making it difficult to budget or regulate if you have users running ad hoc queries against the Data Lake. To work around this, you can leverage BigQuery’s cost controls, but it can still restrict the amount of analysis you can perform because it limits the queries you can run.
Like BigQuery, Snowflake has an architecture that separates the compute query engine from data storage. As a result, it is highly scalable at any amount of volume and concurrency. Pricing is based on the storage and compute used on a time-basis with their virtual databases instead of per bytes scanned. Tuning, indexes, and distribution keys aren’t required for queries to be optimized and performant. Because of these reasons, it can be said that Snowflake has many of the benefits of both Redshift and Big Query.
Snowflake is a relatively new database in the market, so if you are familiar with SQL functions supported by databases like Redshift or Postgres you may find some inconsistencies in the SQL syntax. Snowflake is also generally more expensive.
Unlike the options above, PostgreSQL is an open source database that is free to download. It can easily be spun up on your local server or hosted on various cloud services such as AWS. Postgres also has an ANSI compliant SQL library and supports an extensive library of third-party and user-defined functions. As it’s a transactional database, it has very fast writes and also has fast reads below ~100M rows.
Postgres is a straightforward, flexible solution that’s different from Snowflake, Redshift, and BigQuery because it is a row-oriented database more suited for processing transactional data over analytical queries. It’s a single database connection not architected for parallel processing, so it generally doesn’t perform as well if you have a data volume of over 1 TB. Consequently, Postgres is great as a database, but is not a good choice for a Data Lake if you have a high volume of data (>1TB).
Selecting a Data Warehouse can be dependent on a number of factors that should be considered before making the investment. If you prefer a cheap, straightforward Data Warehouse you may be tempted to go with PostgreSQL, however it will have trouble scaling as a Data Warehouse.
Redshift is a good choice as a standard cloud Data Warehouse if you have the capacity for a dedicated DBA. BigQuery and Snowflake are both highly scalable solutions considering their architecture. However, if cost or concurrency limits will be an issue for you then Snowflake would be more suitable for your organization.
Remember all of these data warehouses are built on the same C-Store architecture so the differences will not be severe in performance. If you’d like a full benchmarking (though the same final recommendation) do checkout Fivetran’s awesome warehouse benchmark.