How to structure Machine Learning Projects? (inspired in the deeplearning.ai deeplearning specialization)

Machine Learning projects, like any, could benefit from having proper frameworks.

The heavy engineering design principles from Deep Learning, make it just more appropiate for that, so you can ask yourself the following , as a starting point:

Is my model fitting well my train data set but not overfitting it?
Are my predictions reliable in the test set?
Can I be certain that my model will work in the real world?

Block 1 - Core Principles, what we all need

Single number evaluation metric: we need to have ideally one single number to agree if we are making progress on our long backlog of experiments. For classification problem could be the F1 score (https://towardsdatascience.com/the-f1-score-bec2bbc38aa6) or Mean Squared error for regression problems (https://medium.com/nothingaholic/understanding-the-mean-squared-error-df41e2c87958). Note that this does not include computation cost and biz metrics, a score considering ML, biz and computation/complexity would be ideal. If the metric does not guide to continous improvement, change it!

Define a baseline, whether is a random guess or human level performance. No project should start without a baseline based on the current quality and uncertainty of decision making.
That baseline is not the target, is the status quo.

Establishing useful and valid data sets: data is scarce, sometimes innacurate and always imperfect. Bare in mind:

The distributions between training and test should be similar if not the same which requires regular check of distribution changes (drift, structural changes...) and random filtering
have as little bias as possible, aka random stratified filtering! https://medium.com/@itbodhi/handling-imbalanced-data-sets-in-machine-learning-5e5f33c70163
Data size needs grows exponentially with the complexity of the model and amount of variables/ features. Data quaility becoming essential <10k observations and less so >100k. https://landing.ai/data-centric-ai/

Assess what is the most productive next task in the continuous improvement cycle:

If your training error is much lower than your validation error, you need to regularize and reduce overfitting (reduce the number parameter or augment penalization on parameters variance).
If you training and validation error is far below your target, you can:

increase training data set
increase model deepness or complexity
look at new state of the art or pretrained models
do error analysis

Although the capacity of ML and DL to countinously amazed practitioners and researchers alike, bare in mind:

Surpassing human level performance has been showed in advertising, medicine, product recommendation, logistics and scoring problems, but it should not be always so, or could become very challenging.

Block 2: Continuous improvement, what to look and how to improve performance

There are several techniques to detect what is the most promising tweak for your algorithm.

Probably, a good place to start is error analysis, which basically consists on trying to find patterns on the samples that your model is not able to classify or predict well.

Errors could be caused by multiple reasons, but the following list cover the majority of them:

too little data on certain classes/products/objects...
inconsistency on the target variable (bad labelling)
wrong model for the given data problem
lack of regularitzation or tricks to avoid overfitting
wrong training, validation and test data split
lack of proper data preprocessing to harmonize data input
too assymetric distributions between training and test set

How can you evaluate and solve each case?

1) Check whether there are reasonable balance in each class or key groups you are analyzing. In a negative case, think of doing stratified sampling, putting more weights in the cost function of the underrepresented cases or create augmentations of the cases with little presence.

2) a random sample of 100 missclassified cases could help you detect target variable issues. If there is a pattern, you can calculate and detect how many cases are affected and apply corrections or filter them out from evaluation.

3) There is no model that fits all problems, and even within neural networks assesing the right architecture could lead to a failure or success of a project. You need to understand the data generating process, the data volumes/format to expect, the amount of interpretability in order to choose the most approapiate model.

4) in deep learning, regularization, hold outs, residual layers help you to avoid overfitting the data. It is particuarly important for very deep neural nets or flexible models like GAMS.

4) As a rule of thumb, with small data set, a train/val/test split of 60/20/20 would work. For large data sets >100k, 95/2.5/2.5 should work too. Make sure is random and each class is represented in each split.

5) Feature engineering and data processing are a must to any successful model, unless you are working with toy/book like data. For real applications, get your hands dirty and check to make the data as harmonize and ready to be consistent for the algorithm.

6) if you application faces very different data in the test or real life, you should gather that data as part of the training data set. Aim as much as possible for distribution similarity and apply moving windows on data usage if change is a constant.

This is incredible to the point, and probably too generic, but it is a great starting point as a framework to assess how to iterate with your ML projects. It does help me to prioritize my team efforts, hope it helps yours!

Alan Fortuny Sicart

Search This Blog