The data science workflow is a non-linear and iterative task which requires many skills and tools to cover the whole process. From framing your business problem to generating actionable insights, this tutorial will give a high overview of all the steps that data science projects follow when they are executed.
Before delving into the complexities of data, sufficient stress should be given to defining the business problem. Companies should be rigorous in defining the problems they’re attempting to solve and articulating why those issues are important. Without the rigor, organizations may end up pursuing data science initiatives that aren’t aligned with their strategies.
Stages of a Data Science Project
- Business Understanding – First, we need to understand the business objectives – for example, if you are a retail store owner, your goal may be to improve sales by identifying the drivers of sales. Also, to achieve the goal you may need to answer sub-questions like – which products are the most profitable? How are the in-store promotions working? Are the product placements effective?
The goal of this stage is to uncover important factors that would influence the outcome of the project. Further, a project plan should be set in place listing the stages of the project, the duration of each stage, resources required, and the outputs at the end of every stage. The project plan will be a dynamic document which should be reviewed frequently.
- Data Collection – This is the stage of the project where you decide on the data that you’re going to use for analysis. For a successful data collection process, identify the factors that affect your business objective defined in the previous stage.
Continuing our previous example, if your objective is to boost the sales of your retail store, the factors affecting the sales may be promotions, product placement, store location, store staff, store hours, competitor location and promotions, product pricing etc. Having these factors listed down provides clarity on the data to be procured for the analysis. At the end of this stage, collect the data containing all of the information listed.
- Data Preparation – Data preparation involves raising the data quality to the level required by the analysis techniques that you’ve selected. This may involve selecting clean subsets of the data, insertion of suitable defaults, or more complex methods such as estimating the missing values by modeling. After cleaning the data, various data sources need to be integrated to create the final dataset to be analyzed. Integrating data involves merging two or more tables that have different information about the same objects and summarizing fields in a table by aggregation.
- Exploratory Data Analysis – During the exploration phase, we try to understand what patterns and values our data has. We use different types of visualizations and sometimes use statistical testings to back up our findings. Most EDA techniques are graphical in nature with very few quantitative techniques. The main role of EDA is to open-mindedly explore with visualizations to gain new and unsuspected insight into the data.
- Modeling – In this stage, you’ll select the actual modeling technique that you’ll be using. Although you may have already selected a tool during the business understanding phase, at this stage you’ll be selecting the specific modeling technique e.g. linear regression, k-means clustering, decision tree, etc. More often than not, a single model may not be able to generate satisfactory results. In such a case, an ensemble of models is used for the task. After assessing the model, revise the parameter settings and tune them for the next modeling run. Iterate model building and assessment until you strongly believe that you have found the best model.
- Model evaluation –This step helps to find the best model that represents your data and how well the chosen model will work in the future. Predictive modeling works on constructive feedback principle. You build a model, get feedback from metrics, make improvements, and continue until you achieve a desirable accuracy. Model evaluation explains the performance of a model. Simply building a predictive model is not your motive – but creating and selecting a model which gives high accuracy on test data. Hence, it is crucial to check the accuracy of the model prior to computing predicted values.
- Deployment – In this stage, you’ll take your evaluation results and determine a strategy for their deployment. Monitoring and maintenance of reports are the key issues to be considered, if the data mining result becomes part of the day-to-day business and its environment.
At the end of the project, you will write up a final report. Depending on the deployment plan, this report may be only a summary of the project and its experiences, or it may be a tool or dashboard which is updated on a regular cadence.
When working with big data, it is always advantageous for data scientists to follow a well-defined data science workflow. Regardless of whether a data scientist wants to perform analysis with the motive of conveying a story through data visualization or wants to build a data model- the data science workflow process matters.
Having a standard workflow for data science projects ensures that the various teams within an organization are in sync, so that any further delays can be avoided. The stages mentioned in this tutorial is an idealized sequence of events. In practice, many of the tasks can be performed in a different order and it may be necessary to backtrack to previous tasks and repeat certain actions.