Why Build a Data Lake

Last modified: October 17, 2019 • Reading Time: 5 minutes

What is a Data Lake?

A Data Lake is a storage repository of multiple sources of raw data in a single location. In the cloud these are typically stored in cloud c-store data warehouses or in S3 buckets, the data can be in a variety of formats and can be structured, semi-structured, unstructured, or even binary.

The term Data Lake, after oil lakes (pre-refinery oil), was created to contrast the term Data Mart which described orderly, siloed, and refined data. Having all the data in one place made it easier to work with large data sets and to start getting out insights earlier in the data modeling process.

Top 3 reasons to build a Data Lake

  1. Unified - makes it easy to query and combine data from various sources to find valuable insights
  2. Performance - after getting data out of their respective tools you can optimize storage, schema, and queries to get the data you need fast.
  3. Progress - building towards a single source of truth

1) Unified

It can be challenging to do complex analysis since you only have access to the data within a single source at a time. Without being able to combine sources it limits your ability to find insights. Many tools that generate data provide fairly basic reporting and analytics functionality further restricting the types of questions that can be asked and making it difficult to find insights.

In a Data Lake, all data can be combined so it can be analyzed together. This makes finding insights easier and provides more depth for exploring the data. Often-times to use BI tools it is necessary to get all the data together first.

2) Performance

Source data might be from the actual production database which could affect the performance of the application that it is powering. Queries that demand a lot of data such as aggregations are not optimally run on transactional databases.

Data Lakes are built to handle these types of ad hoc analytical queries independently of the production environment. You can scale up resources on a Data Lake to be able to query data even faster.

3) Progress

Raw data comes in many formats that can be tricky to query. While your production database is likely in a SQL format, other tools will store data in more complex ways such as JSON.

JSON format:

{
	"firstName": "John",
	"lastName": "Smith",
	"age": 27,
	"phoneNumbers": [
		{
			"type": "home",
			"number": "212 555-1234"
		},
		{
			"type": "office",
			"number": "212 555-1234"
		},
		{
			"type": "mobile",
			"number": "123 456-7890"
		}
	]
}

Normalized SQL format:

User

id firstName lastName age
1 John Smith 27

PhoneNumbers

User_id type phoneNumber
1 home 212 555-1234
1 office 212 555-1234
1 mobile 123 456-7890

ELT tools move data into a lake and provide ways of getting your data into a SQL format so it can be queried easily. This is also a big step towards creating your single source of truth for your data

Example Data Lake

Multiple Schemas

We can see a variety of data sources and schemas we can query from.

Query Across Schemas

Combining datasets can be tricky. Having a flexible BI tool such as Chartio allows you to navigate this with relative ease.

Written by: Tim Miller, Matt David
Reviewed by: Dave Fowler

Next – Data Lake Architecture

Get new data chapters sent right to your Inbox