ML in Production Specialization: Course 1 (DeepLearningAI)

The world is full of models in notebooks, but how many manage to make it to production and add value?

In order to make sure we succeed, let's understand the main challenges:

concept and data drift changes: in a dynamic world, it could be a data distribution or domain change over time, we need to be able to detect (with the updated test set performance check and distribution similarity with recent data) and have a plan of action for it (retrain, fine-tune, clean, rescope)
software engineering issues: design decisions are key and create a legacy (real-time or batch), where the service runs (cloud or edge browser ) and its dependency on the internet (edge can work without internet or while facing internet connection problems), the amount of required compute resources. Other important considerations are the latency (speed to response) and throughput (how many queries per second we need to be able to operate). It is important also to decide what data to log for further review and training, as well as security and privacy requirements.
trust: at the beginning, using the app in shadow mode may be ideal to compare what the machine and human judgment lead to. Once the results are promising, a small percentage of the data is dealt with by the algorithm, and is tracked before is scaled up. In case new versions are released, for some time the old version keeps running to confirm is significantly better, so we can rollback if necessary.
automation level: it is important to understand both the desired level of automation and the possible level of automation. In some cases, partial automation is very good level when the level of accuracy is not good enough to be unsupervised but it is useful to prioritize manual efforts.

A good design and a good implementation do not mean we are done, then monitoring comes into play:

software metrics: memory, compute, latency, throughput, server load
input metrics (evaluate the input data): length, distribution, volume, missing data
output metrics: length, distribution, quality, consistency, missing output...

We need to define thresholds for those in order to trigger alarms and ensure check.

Next, we will deep dive into what to take into account when training or building a model:

We need to train a model that does well at the training set AND the test set
Overall metrics need to account for key data points, users and also biases...
It is also important to account for rare classes when calculating metrics and performance
Stablish a baseline, either based on a naïve or previous model, or human performance on the task

In order to get started:

A simple framework available in blogs and already used frameworks will suffice, not need to be the most cutting-edge algorithm in the literature
It is important to understand our computing constraints to select potential models to use, including data requirements or inference latency

Once you have a model, error analysis and tracking is key:

An error analysis allows you to decide and prioritize what to do next
for classification problems it is essential to review the confusion matrices
you can decide to prioritize those with higher potential improvement with respect to the baseline, or those with a larger population size
try to understand errors in subsets of data, and how common different types of errors are (FP, FN...)

ML in production should be data and not model-centric

that means focusing on maximizing data quality
keep the code fixed and see how data quality enhances model improvements
considering data augmentations to increase model robustness: adding noise to the data will help to increase sample size and generalization
data augmentations type should not overrepresent data samples, or be too close to another samples : 1s and is...
for structured data augmentations are difficult, but we can create or select features our of it. If manual features are to be avoided or not required, embeddings can be used instead.
focus in high quality data instead of more data: covering key cases, consistent, representative distribution, properly sized.
past performance is not guarantee of future results, we need to keep an automatic eye on model performance on new data. Retrain with both past and recent data, with more weight on recent observations is generally recommended.
to further improve model performance, improving 10% data quality serves as much as increasing data gathering, while being much cheaper
good data is consistent in their mapping, covers key cases, is properly sized and fairly distributed

Data Labeling is challenging, here is some general guidance to improve data consistency:

we need very clear definitions of the labels, otherwise the model will struggle to find a pattern
What is the input x: what should contain?
data cleaning strategies diverge if we are working with big data or unstructured. For example for small samples, ensuring every sample is correct, via manual or automatic inspection is a must, for larger sets and unstructured data, small sample labeling or data augmentation techniques will do the work. For large structured data sets, it is important to ensure lean processes of data collection, as detecting all errors will be challenging.
In case there is disagreement between labelers, or labelers and ML algorithms, ensure experts assign the right one or filter observations where there is not enough information to make a decision.
In case the granularity of the label is too much and hard to separate by ML models, one strategy could be to merge those (deep and shallow scratch).
It is more important to have good labeling instructions than having a lot of versions of the label for a particular sample and then voting the most common.

Having a baseline and target is essential to measure the value of ML, one key data is HLP (human-level performance)

This is normally the irreducible error we can aim with the model, and should not be the baseline, as in many cases this is the roof, rather than the ceiling.
It is very important to understand the extent of agreement and clarity of y or the target variable, in order to properly measure HLP. Superiority or reaching HLP on a blurry definition of y is arbitrary and tells little about the value given to the user.
Ground truth to define HLP should be based on objective labels, ideally neutral measurements, and not a particular individual input (manual diagnosis being inferior of biopsy, perceived exertion versus metabolic indicators, individual input versus actual measurement of the soil).
Rather low HLP normally means the y definition is ambiguous.

Having a clear strategy when gathering data is essential to ensure good model performance and minimize the cost of provisioning:

Ensure data gathering is done in small buckets where the iterative process of training a model and doing error analysis is enabled as soon as possible, ideally the first week of work
It is important to make an inventory of existing or planned data sources, and estimate the data amount requirements and also its derived cost and time
Consider also the quality, privacy, and regulatory constraints

Good Data Pipelines are essential for replicability and stability in the ML process:

for the POC phase, this is not as essential, as long as everything is done in properly documented in the code in the notebook or repo, the goal is to test if the process is worth being brought to production
It is essential to document where the data comes from, what are the steps before the data is passed to the model, so call provenance and lineage
Having clear requirements of the required metadata is also important, even if not used directly by the model, as it could help to understand the sources of data and errors later
It is very important to balance train, dev, and test sets to assess the performance of the model. This is particularly important for small samples or where some classes and not very much present (for example when you have only 20% positive samples, it is important that the proportion holds for all sets).

There are a lot of things that can be done, but not all are equally important or urgent, and because of that, we need to prioritize and limit the scope of work:

First clearly state the business problem to solve into concrete data questions: what should I measure? how consistent is the data I have collected? what is the impact I can measure with that data? what is the goal of the application (get more finance, sales, reduce cost, increase transparency...)
Once the biz problems are clearly stated, it has to be confirmed that there is a machine learning way to solve it or contribute significantly to improve the indicators related to that problem
Assess the feasibility in terms of time, budget, and resources...
The estimation of feasibility should be based either on valid experience or external research: literature, case studies, blogs...
HLP performance indicators are a good initial target to calculate expected value, although in many applications the goal would be to improve HLP or in some cases will be very hard to reach HLP(as with self-driving cars) ( can I humans with the same data perform as desired?)
An inventory of predictive features availability is important to confirm the possibility of reaching the expected performance
When defining metrics, it is important to reach a consensus and not create valid ml KPIs that are too far from biz impacts, or biz metrics that depend on many factors but the actual ml engine reach to build or measure
Consider the net social and environmental value of this project, and dedicate your time to projects that clearly help society as a whole, not a few
Note down key specifications:

metrics
software performance and availability
biz metrics and expectations
resources needed: computational, personnel, stakeholder involvement, data...

Interesting notebooks and links to:

Asses class imbalance and overfitting
data and concept drift: Machine Learning in Production: Why You Should Care About Data and Concept Drift | by Elena Samuylova | Towards Data Science
framework to decide when to retrain or update a model : To retrain, or not to retrain? Let's get analytical about ML model updates (evidentlyai.com)
ml system monitoring: Monitoring Machine Learning Models in Production (christophergs.com)
what does data centric development means? :A Chat with Andrew on MLOps: From Model-centric to Data-centric AI - YouTube

Alan Fortuny Sicart

Search This Blog

ML in Production Specialization: Course 1 (DeepLearningAI)

Comments

Post a Comment

Popular posts from this blog

Grundlagenschulung der KO Kapitel 1: Einleitung