FastAI Deep Learning Journey Part 10: Tabular modeling, and the rise of categorical embeddings

Tabular data is the most common learning problem in industry, whether you are trying to predict churn, build a recommender system, or classify consumers you are likely going to use as a input a matrix where columns are variables and rows are observations.

In this problems, we normally work with both categorical and continous variables, and there is a very common challenge that overlooked in most machine learning courses and also among practitioners. I will phrase it as a question:

If models require inputs in the form of numbers and continous variables tend to help more in the learning process and overfit less that discrete levels, how can we map categorical variables into the continuous space?

The anwser to this very essential question can be found in the fastAI course or in the following paper Entity Embeddings on categorical variables. I will summarize the key takeaways from the fastAI course and the paper in this post, as well as to confirm that the results generalize exactly to the house price data set I choosed from Kaggle Housing Price Competition.

As I have spend quite some time commenting the code and curating the text from this notebook, I encourage you to go to the notebook to see how it is implemented, here I will cover only the main ideas and results. MyCode

Idea 1: for tabular data you do not need to know all available models, for supervised modeling random forests will get you close to the state of the art

I was surprised and relieved that this course nicely cover the whole tabular data topic in one single lesson, making a point that is in line with my experience, both in industry and in most competitions I participated. We may not like the "randomness" of random forest, but they work very well and there are a lot of intepretable features we can extract, with little compromise on performance with respect other methods.

There are two conditions where one should consider deep learning instead of random forests:

When using categorical variables with many levels or high cardinality
When using raw text, images and unstructured data sources

In the dataset of the course and many competitions, one could have data where it is mostly fine for random forests, but you have a little bit of 1 and 2 in there, so I will cover in the next idea how to overcome condition 1 and still use random forests, for people who still want to be able to do the following (supported by random forests but not by neural nets)

Most important variables detection

Clear Impact on each variable on the prediction (partial dependence and waterfall plots)

Uncertainty of the prediction (using the standard deviation of the prediction of each tree)

Idea 2: in order to make the best our of highly cardinal variables, we can use neural networks where we concatenate continuous variables and categorical embeddings in order to learn a continuous representation of each category. This benefit both performance and interpretability of the categorical variables.

In both the original paper Entity Embeddings on categorical variables , fastai course and my own experience with the housing data set, using neural nets on tabular data that contains continuous and categorical variables provides a boost on performance from random forests, but also one can extract, after some preprocessing, interpretable features of how the categorical variables levels relate to each other, allowing for a extrapolation outside the training domain. The following example comes from the Rossman sales data set competiton used in the paper.

Idea 3: If one want to keep the interpretability of random forests, but with proper continuous encoding of categorical variables, one can use neural nets as a generic feature engineering approach.

This is very important. Instead of performing one hot encoding or aggregations, one can train before a neural net on both the categorical and continous variables, extract the embedding from the categorical variables and use them on a random forest.

Here is the code to do that:

def embed_features(learner, xs):
    xs = xs.copy()
    for i, feature in enumerate(learner.dls.cat_names):
        emb = learner.model.embeds[i]
        new_feat = pd.DataFrame(emb(tensor(xs[feature], dtype=torch.int64)),\
         index=xs.index, columns=[f'{feature}_{j}' for j in range(emb.embedding_dim)])
        xs.drop(columns=feature, inplace=True)
        xs = xs.join(new_feat)
    return xs


emb_xs = embed_features(learn, to_nn.train.xs)
emb_valid_xs = embed_features(learn, to_nn.valid.xs)

m = rf(emb_xs , y)
m_rmse(m, emb_valid_xs, valid_y)

Idea 4: It is likely that random forests and neural nets are going to be wrong in different ways, therefore it is good to emsemble the models to maximize the overall performance (if that is the primary goal).

We can simply average the predictions of the best random forest we can get with our tabular deep learner and likely going to get better overall results that each one separate. Here is the summary of the model results on the Housing Price Competition:

Random Forest without categorical embeddings (RMSE): 27.821
Tabular Learner (NN) (RMSE): 25.690
Random Forest with categorical embeddings (RME) : 27.336
Emsemble of 1 and 2 (RMSE) :24.267

Conclusion

Tabular supervised problems are probably the most common among practitioners, and one can benefit the most with two methods: Random Forests and Neural Nets.

The first is robust to overfit, accurate and highly interpretable, but it cannot extrapolate out of domain data well like time series.
The second is normally the most accurate and extrapolable, but at the expense of interpretability and sensitivity to hyperparameters.

For those seeking to maximize a metric related to predictive power, an esemble of both seems like the best idea.

For those looking for a balance between accuracy and intepretability, and random forest with categorical embeddings is your option.

I found those ideas very powerful, not only for their practical implications, but because they are holding on houndreds of data sets, while remain overlooked by many.

Alan Fortuny Sicart

Search This Blog