Machine Learning Project Checklist

3 min readApr 28, 2019

As a Data Scientist or Machine Learning Engineer working in a real world Machine Learning Project, you should go through this checklist to make sure that everything goes well throughout the whole pipeline.

There are eight main steps:

Frame the problem
Get the data
Explore the data
Prepare the data
Model the data
Fine-tune the models
Present the solution
Launch the ML system

The very first step is where the (business) objective is defined and assumptions about the solution are made. The following questions need to be answered:

What are the current solutions/workarounds (if any)?
How the problem should be framed (supervised/unsupervised, online/offline, etc.)?
How should performance be measured (metrics) ?
What would be the minimum performance needed to reach the business objective?
How the problem can be solved manually?

2. Get the data

In this step, the required data is gathered and splitted into training set and test set. The following steps must be considered:

List the type of data required and quantity
Find appropriate source to get data
Check how much space it will take
Check legal obligations, and get authorization if necessary
Get the data
Convert the data into required format
Ensure sensitive information is deleted or protected
Split data into training set and test set

3. Explore the data

The gathered data is explored to gain insights: study and understand more deeply about data. The following sub-steps should be considered in this step:

Make a copy of data to explore
Visualize the data
Study about attributes (name, data type, missing values)
Study the correlations between attributes
Document everything learned about data

4. Prepare the data

After knowing about the data, data need to be cleaned and pre-processed before training the models.

- Data Cleaning:

Fix or remove outliers (binning & clipping)
Fill in missing values or drop rows/columns (scrubbing)

- Feature engineering:

Select appropriate features
Discretize continuous features
Feature Crossing: Create new feature by combining two or more features
Feature scaling: standardize or normalize features

5. Model the data

Now, its time to train various models (algorithms) and list down the most promising models. The following sub-steps need to be considered:

Train many quick and dirty models from different categories (e.g., linear, naive Bayes, SVM, Random Forests, neural net, etc.) using standard parameters
Measure and compare their performance (use N-fold cross-validation)
Analyze the most significant variables for each algorithm
Analyze the types of errors the models make
Short-list the top three to five most promising models

6. Fine-tune the models

From the short-listed few models, we need to find the best performing model. The major tasks to be considered are:

- Fine tune the hyper-parameters using Cross Validation

Grid Search / Random Search
Bayesian Optimization

- Try Ensemble Methods (combining best models)

- Keep the best model after measuring performance on test set.

7. Present the solution

This is much less technical step. The proper documentation until now should be prepared and presented in the best way possible highlighting the big picture. Remember the following steps:

Document everything what is done until now
Create a nice presentation
Explain why this solution achieves the business objective
Don’t forget to present interesting points noticed along the way
Describe what worked and what did not (list assumptions & system’s limitations)
Ensure that key findings are communicated through beautiful visualizations or easy-to-remember statements (e.g., “the median income is the number-one predictor of housing prices”).

8. Launch the ML system

Finally, its time to launch and deploy the ML System in the desired platform. As it a software solution, the testing and maintenance also come under this step.The following things need to be done:

Get your solution ready for production (plug in production data, write unit tests, etc.)
Monitor the performance at regular intervals
Retrain the models on a regular basis on fresh data

Extra Tips:

Always work on copies of the data
Never look on test set (use test set only for final model testing)
Try to automate steps wherever possible on all of above steps
Write functions for all data transformations (reuse)
Feel free to adopt the checklist as per your needs

Originally published at https://exploreml.blogspot.com on April 28, 2019.