Now that we’ve established what type of changes we want to make and decided on what engine to use for our Data Warehouse, let’s go through the process of getting data from the Lake into the Warehouse. While this sounds complicated it is only comprised of using SQL to create Views.
We recommend using SQL to perform all transformations. It is the standard language for relational database management systems (which is what a Data Warehouse should be) and that is the environment you are probably using for your Data Lake. Working in a SQL based model is ideal because a variety of tools and platforms already exist to write and execute queries. Also, data engineers, analysts, and some business users already understand how to use it.
Views allow us to quickly reformat what the data looks like for people querying without needing to build a new Data Warehouse or incurring costs from storing any additional data. Unless you are dealing with massive amounts of data there are not significant performance gains in creating new tables or materializing the views.
Use a Modeling tool: dbt
Instead of writing the views directly on the database (which is an option) we recommend using dbt for creating your SQL views. dbt provides many features to help you keep a clean Data Warehouse such as version control, logging, and much more.
Data Lake to Data Warehouse View Examples
Here is an example of applying a transformation to move from Data Lake to Data Warehouse. First, we build a query to combine a couple of salesforce objects into a single table. For example, using information about an individual and their role within a client company can give you more insight into how you may want to interact with that person.
So, getting information on that person’s role into the same table as his/her contact along with some basic demographic information, will save the end user some time in querying the Data Warehouse.
That query might look like this:
We are choosing a subset of the total possible columns and rolling/denormalizing the table a bit to make it easier for others to query. To make this code into SQL that builds our Data Warehouse, we need to add CREATE VIEW. So the query would actually be:
CREATE VIEW salesforce_user AS SELECT u.id ,u.name ,u.email ,u.department ,u.phone ,u.phone ,u.created_date ,u.is_active ,u.last_modified_date ,ur.name as role_name ,ur.rollup_description as role_rollup FROM salesforce.user as u left join salesforce.user_role as ur on u.user_role_id = ur.id;
If we go back to the example first introduced in the Why Build a Data Warehouse article we can walk through all of the transformations described in one SQL query. So let’s look at that messy table with all of the hard to understand/query fields.
We then want to make all of the following changes to that data:
We can create this as a series of SQL statements in a dbt file of common table expressions with a final CREATE VIEW query at the bottom:
-- drop unused column External_id WITH t1 AS ( SELECT Id, Name, Display Name, Email, Location, Type, Info, Status FROM dl_table ), -- Add consistent column Email t2 AS ( SELECT Id, Name, Display Name, Email, Location, Type, Info, Status, is_deleted FROM t1 JOIN dl_email ON t1.Id = dl_email.Id ), --Standardize Location column t3 AS ( SELECT Id, Name, Display Name, Email, CASE WHEN Location = "US" THEN "USA" WHEN Location = "Texas" THEN "USA" WHEN Location = "Sao Paulo" THEN "Brazil" ELSE Location END AS "Location", Type, Info, Status, is_deleted FROM t2 ) --Make column names and values descriptive for Type t4 as ( SELECT Id, Name, Display Name, Email, Location, CASE WHEN Type = "1" THEN "Can view" WHEN Type = "2" THEN "Can edit" WHEN Type = "3" THEN "Can admin" END AS "Access Level", Info, Status, is_deleted FROM t3 ) --Parse relevant fields, drop original column for Info t5 as ( SELECT Id, Name, Display Name, Email, Location, Access Level, CASE WHEN Info = "%active" THEN "active" WHEN Info = "%inactive" THEN "inactive" END AS "Status", is_deleted FROM t4 ) -- filter row that was deprecated from is_deleted, and drop column t6 as ( SELECT Id, Name, Display Name, Email, Location, Access Level, Status FROM t5 WHERE is_deleted != True ) -- create view for Data Warehouse CREATE VIEW dw_table AS SELECT * FROM t6
For a given table we suggest managing all transformations step by step in common table expressions with notes describing what is happening at each step.
We now have a clean view of the original data
A Data Warehouse may still have a few issues in the data but the vast majority should be handled with obvious work arounds.
- Create Views for your Data Warehouse
- Lightly clean and denormalize your data so that it is easier to query
- Use a modeling tool such as dbt to manage these transformations