How should you get your various data sources into the data lake? Well there are two common paradigms for this.
- ETL is the legacy way, where transformations of your data happen on the way to the lake.
- ELT is the modern approach, where the transformation step is saved until after the data is in the lake. The transformations really happen when moving from the Data Lake to the Data Warehouse.
ETL was developed when there were no data lakes; the staging area for the data that was being transformed acted as a virtual data lake. Now that storage and compute is relatively cheap, we can have an actual data lake and a virtual data warehouse built on top of it.
We recommend ELT because
- We care more about Extracting and Loading the data into a common place at the Data Lake stage.
- Data on a lake will go through heavy transformations during the next stage, so there is no need for complex logic before the data is loaded. The transformation step of ELT happens in the data warehouse.
- We can end up with a much simpler architecture which means less problems and less maintenance.
- Data lineage becomes easier to track as complex transformations are not happening prior to loading the data into the lake.
Light transformation of the data before loading the data into the lake might still be necessary:
- Column Selection: Select the data that really matters. For example, not everything in Salesforce needs to be synced.
- Privacy reasons: for example, filtering out columns that contain PII (personally identifiable information). Instead of filtering, you might want to hash PII data so they can be used for your analytics.
The above transformation cases can be included in the ELT paradigm and are offered by most commercially available ELT vendors.