How should you get your various data sources into the data lake? Well there are two common paradigms for this.
- ETL is the legacy way, where transformations of your data happen on the way to the lake.
- ELT is the modern approach, where the transformation step is saved until after the data is in the lake. The transformations really happen when moving from the Data Lake to the Data Warehouse.
ETL was developed when there were no data lakes; the staging area for the data that was being transformed acted as a virtual data lake. Now that storage and compute is relatively cheap, we can have an actual data lake and a virtual data warehouse built on top of it.
We recommend ELT because
- We care more about Extracting and Loading the data into a common place at the Data Lake stage.
- Data on a lake will go through heavy transformations during the next stage, so there is no need for complex logic before the data is loaded. The transformation step of ELT happens in the data warehouse.
- We can end up with a much simpler architecture which means less problems and less maintenance.
- Data lineage becomes easier to track as complex transformations are not happening prior to loading the data into the lake.
Light transformation of the data before loading the data into the lake might still be necessary:
- Column Selection: Select the data that really matters. For example, not everything in Salesforce needs to be synced.
- Privacy reasons: for example, filtering out columns that contain PII (personally identifiable information). Instead of filtering, you might want to hash PII data so they can be used for your analytics.
The above transformation cases can be included in the ELT paradigm and are offered by most commercially available ELT vendors.
Transforming for Analytics
While we do not recommend much transformation occuring at the Data Lake stage, there are a few scenarios where light changes are useful.
Using a file system
In the case of S3 as a data lake, you can use Amazon Spectrum  to transform your data from S3 and load it into Amazon Redshift using SQL queries. You will then be able to run analytical queries in your BI tool.
In the case where you use a data warehouse database such as Redshift for your data lake, you can use a technology like dbt to transform your data to make it more easily queried by your BI tool.
Keep in mind that you can use these same transformation tools to extract data from the Data Lake and load it into your Data Warehouse.
How to ELT to a Data Lake
There are many options if you want to move your data into a Data Lake, regardless of the source it comes from.
- Use an ELT vendor like Fivetran, Stitch Data or Blendo
- If you want to perform ETL, you can use something like Xplenty  or Amazon Glue 
- DIY. If you go this direction, please use at least a framework like Apache AirFlow . The last thing you want is a mess of scripts and cron jobs deployed haphazardly.
Don’t do DIY
The DIY should be avoided, you will dedicate precious engineering resources to something that can be done at a fraction of the cost and time using a cloud solution for ELT. Your data engineers can work on more important data projects related to your overall data infrastructure and product.
Use an ELT Vendor
All ELT providers mentioned above offer a simple experience for extracting and loading your data into your data lake. The process usually involves the setup of a pipeline where credentials are given for both the destination and the data source and some configuration where light transformation is performed, e.g. selecting what tables and fields to sync, hiding some values for privacy reasons, etc. This process can be performed with minimal engineering effort in most cases.
Extract and Load Options
When we move data into a Data Lake, we have a couple of different strategies on both how we extract data from the sources and how we load the data into the Lake.
Extraction is the phase of pulling data from a data source, through APIs or SQL. We can do a complete extraction of all data available, or we can do an incremental extraction every time we run a sync. A complete extraction will extract all the data from the data source. An incremental extraction will only extract updated records from the data source.
A complete extraction is the easiest way since no configuration is required but it has two big disadvantages.
- You end up with a lot of duplicate data in your data lake
- You increase the complexity of the next steps in your analytics stack
You will have to figure out what data you actually need in the data lake, so it will require more complex logic to do it and more processing.
The preferred alternative is to do incremental extractions. This is more challenging since you need to check for new or updated rows and account for changing schemas. However, it is typically preferred because much less data being processed and fewer updates will need to be made in the data lake. All the cloud ELT vendors support incremental extractions from your sources.
The main downside to incremental extraction is deletions in a data source. It’s not easy to detect and implement deletions in the general case. ELT providers do not guarantee consistency under deletions in most cases, in some cases it can be done or it is implemented by the source, e.g. data is never deleted but flagged as
A complete dump would guarantee that you have always an exact replica of the source state. Keep in mind that in analytics this is not important in the general case, but keeping the deleted records might also be something that is required.
However you extract data from your data sources, you need to decide how these changes will be reflected on your destination. You can push changes through to existing data in the data lake or you can store this new data separate from existing data.
If you are using a database system as a data lake, then you can update the data with the pushed changes. This will end up having a close replica of the data from the source system to your Data Lake and it optimizes storage.
The other way is to save the changes without updating the records. This is pretty much the only way you can do it if you use a file system and don’t want to add a lot of complexity on your data lake. The benefit of doing this is that you have a history of all the changes that happened on your data.