How to Consolidate your Data Sources

Being in a SaaS company, you have an overflow of how many data sources you have access to, as well as how much data you need to be working on. You have your financial data on Stripe, your customer data on Salesforce and Intercom and Zendesk, your product analytics data on Jira and Google Analytics. You have an abundance of data, and you’re getting frustrated in accessing all of these data one-by-one, or also known as point-to-point model. That’s because if you haven’t consolidated your data, you risk being unable to scale as there is more data in your company.

In the following article, we’ll introduce some of the models that can help you understand the methods of combining your data sources.

Point-to-Point Model

The point-to-point model is a simple layout where every tool communicates with every other tool to share information. Typically, this is what happens when you first start out. An example is using Salesforce for retrieving customer data, Google Analytics for your website traffic data, and Hubspot for your marketing data.

As you have more and more tools with increasing data diversity, it gets increasingly complex to manage and is cumbersome for each employee in the company to go and request individual data from each data source, and so you would need to consolidate your data.

Hub-and-Spoke Model

The hub-and-spoke model is an approach where all your data sources connect to a central place and have the data consolidated into one place for multifaceted data management. As seen below, this simplifies the connection process and it scales better as more data sources exist in your toolbox. This hub is also known as Online Analytical Processing (OLAP).

Benefits of a hub-and-spoke over a point-to-point model:

  • Data Visibility – A centralized data management allows a view of all your data.
  • Save Time – Business users can quickly access data from multiple sources within a central location, meaning that time won’t be wasted on retrieving data from multiple sources.
  • Scalable – The hub-and-spoke model doesn’t become exponentially complex compared to the point-to-point model.
  • Improve Insight – Having a central place allows awareness of all the data to provide big picture insight.
  • Improve Security – Managing who has access to your data is much easier when there’s a centralized connection point.
  • Connect with your favorite BI tool – Create dashboards and reports by stitching different data sources together in real-time. For example, you can try Chartio which has a drag-drop interface with no SQL knowledge required.

Here are some examples of some common hub-and-spoke models to consolidate your data sources. We will start by going through the different cloud environments of Private and Public cloud, and then listing some data storage solution in data lakes and data warehouses.

On-Premise Data Centers

Otherwise known as “private cloud”, you could have an on-premise solution. This is a good idea if you have very sensitive data and would like to control all the aspects of having servers in your data center. What’s great is that you can run your own systems in your controlled cloud environment. An example of this is running your own Hadoop cluster on your own hardware, without exposing customer data to third-party cloud providers.

Advantages of on-premise:

  • More flexibility—you can customize its environment to meet your specific business needs.
  • Improved security—resources are not shared with other companies and so is more secure.

If you decide to start an on-premise solution, be wary that it might be expensive. Along with the setup hardware costs, keep in mind the high maintenance and security costs. It is also risky to have your data sitting in one place, and managing the security and safety of these servers can be difficult.

Hybrid On-Prem with Cloud

This is a good solution for companies that have an existing on-premise solution and are looking to migrate with the cloud, and also sharing the benefits of having best-of-both-worlds of controlling what you know, as well as the benefits of using cloud consolidation.

Advantages of hybrid clouds:

  • Control—you can maintain a private infrastructure for sensitive assets.
  • Flexibility—when you need additional resources, you can use the public cloud.
  • Cost-effectiveness—if you need to scale, you pay just for the extra computing power from the public cloud service.
  • Ease of transitioning—you can migrate in incremental steps towards the cloud.

Public Cloud

You could also use other vendors that provide cloud services to consolidate your data, but without the hardware setup cost for setting up your own servers.

Advantages of public clouds:

  • Lower costs—no need to purchase hardware or software, and you pay only for the service you use. Note that they may get more expensive than running your own when you reach a certain scale.
  • No maintenance—your service provider provides the maintenance.
  • Near-unlimited scalability—on-demand resources are available to meet your business needs.
  • High reliability—a vast network of servers will mitigate against failure.

Data Storage Solutions

Now that we have gone through the different cloud environments, let’s go through the two popular data storage solutions that are out there, data lakes and data warehouses. Many of these exist in the public cloud provided by companies like Google, Amazon and Microsoft, but there are on-premise solutions as well.

Data Lake

A data lake is a cloud solution to storing all of your collected data. You can use a data lake to connect to all of your various tools and store all of the data as a schemaless database. You can connect with all the different sets of data and aggregate them. Being schemaless, you can store structured, semi-structured and unstructured data, and thus doesn’t need anyone to preset data types and model the data. You just load in the raw data and it will be able to store it. This flexibility is especially useful for data that comes at a very fast velocity that has potential to change its structure, and would otherwise require complex schema management and change. You then provide the schema when reading from the data lake (schema-on-read).

A data lake also tends to be cheaper to store data than a data warehouse, particularly due to data lakes like Hadoop being open sourced and its ability to work on low-cost hardware.

Data Warehouse

Another way to consolidate your data sources is to use a data warehouse. It is a database system which stores all the data that is in your various tools and consolidates them within a central location. This data is then processed into a semi-structured state so that a ETL (Extract Transform Load) tool can connect to a BI (Business Intelligence) tool for processing. The major difference from a data lake is that data warehouse needs a model of the schema beforehand (schema-on-write), which can provide benefits in the users downstream of the data warehouse, namely the BI users to create reports and dashboards. These benefits include better performance due to greater efficiency in searching structured data. Data warehousing tools are also more mature as they have been in the market for much longer. Thus, use data warehouses where performance of real-time systems is critical.

To learn more about data warehouses, have a look at our tutorial going deeper into what makes a good data warehouse. For our comparisons between different data warehouses, look at:

Conclusion

Consolidating your data from point-to-point to hub-and-spoke model helps you deal with increasing data volume and variety. There are many options of consolidating your data sources following the hub-and-spoke model, whether it is using your own servers, some cloud provider, or a hybrid of both. Whichever option you choose, consolidating your data sources will help you scale better as you have more data from all the different sources.

Resources:

Jonathan Kurniawan

About Jonathan Kurniawan

Hi! I'm Jonathan Kurniawan. I have 4 years of experience working as a software engineer at Dolby on various different products. I'm currently pursuing my MBA from Hult International Business School and received my Bachelor in Computer Science from University of New South Wales, Australia. I'm excited to share my knowledge at the Data School by Chartio.