Why the present and future of image representation is self supervised (Part 1)

It is fair to say that there are two main challenges in applying computer vision in industry and in research groups. The first one is the complexity of the topic and the methods used so far. The second is the immense amount of labeled data to get sufficient good results, particularly when transfer learning is not an option.

While the first problem have multiple solutions, the second have for many a much less clear roadmap to go. How can we properly extract features and maximize information retrieval from an image, if we do not have labels? How can we ensure we are building a generalized and human aligned encoding of an image if we have no labels to "bias" or direct the learning of our neural net? In the following posts I focus on how can we create high quality image representations that are useful, unbiased, generalizable and do not required an insane amount of training or gpu compute. Let's get started.

Key definitions and concepts before we get started

It is important that we set first a common ground of definitions and concepts so we are clear of what we are talking about and also what we are pursuing. As from the previous posts about convolutional neural nets I showed how the last layers of the network can be use as features to classify an object and compute the similarity between two images. The reason we can do that is because the last layers of the network, such as resnet34 or another network we have build contain a condensed (lower dimensional) representation of high levels of abstraction of the image (which object is in the image, the coloring, the contrast, the main silhouettes, background) and that could be use for downstream tasks for both classification or regression problems.

If we already have a clear problem at hand, for example, product sales based on article features or a tumor in a medical image, we probably want to label houndreds of thousands of those images so the model can find what are the aspects in the image that help us to solve the problem. In this setting we could say the neural network learning is supervised and biased toward a kpi or a goal (product sales or tumor classification in our examples). Assuming you have define the right problem, have sufficient time and resources to get sufficient data and use the proper architecture, you will get very good results, probably above or at human level.

But what happen when you do not have labels, or the distribution of labels change too often, or it is even not clear yet what the downstream task will be? Those cases are very common, as budget is limited, scope change and the focus of the project can change rapidly.

Based on my limited knowledge, those are the most likely options to get embeddings without labels that could work for any downstream task:

Use a pretrained model directly for your downstream tasks (pretrained models)
Develop an autoencoder that compress as much information as possible from the image in a lower dimensional vector (unsupervied embeddings)
Develop a model where you can automatically create positive and negative samples of your data and using a loss function like the tripet loss, pulling out of the last layers of the network as embeddings (self supervised embeddings)

We explore each option in detail in the following sections:

Using pretrained models

Using and testing pretrained models in our downstream tasks should not be a problem. We need to download the model object and pass it to our images (as the model expects) and extract the embeddings we need to use it later for our regression or classification task.

Before you jump of joy, please consider the following caviats. First of all, the pre trained models are not necessary trained on similar data like the one affecting your problem. The context (sigle object versus multiple objects, background noise...) could be different and lead most likely to poor performance, the further the gaps are between the two data sets. This is why most of the state of art models use transfer learning end up adding layers or updating multiple layers based on their own data.

The most common data set used in pre trained models is Imagenet, which contains ~20k classes. Check if you industry or topic is covered and if not search for a bit for others, as the archive of pretrained models and open data sets is growing, you may find one that works for your own problem. At the time of this writing, it seems that for many sectors like health or environment, there are not as many diverse pre-trained models to be leverage openly as we all wish.

Last but not least, even if you share the same domain and image types, it could also be that the classification of classes labelled in the pre-trained model data is not in line what you have in mind (for example a data set on people labelled by age may not be the best one to predict whether they smile or not).

Without having labels for at least some of your data, pre-trained models are not surely the solution for your problem, as little customization for your images and problem can be done.

Using AutoEncoders or any of its variations

Luckily, not having labels is not a problem for autoencoders. Those models, heavily used to compress image information for efficient processing in web applications and phone apps, could be leverage to create amb embedding space that is unsupervised in its contruction.

While they are a great starting point there are some problems particularly if we are interested in assessing object similarity between two images. Our limited research suggests the following:

VAE/AE/DVAE tend to not change the embeddings sufficiently when some aspects that are not frequently present in the data (in our example was big logos). Therefore provide too similar embeddings for articles with different design elements.
They tend to put too much weight on changes on a limited area of the object (in our data was small changes in the inserts change the whole embedding encoding ~50%).

The papers we review point out that self-supervised learning does not depend so much on the distributional contraints of models like AE, VAE or disentangle VAE, and therefore are more robust in their encoding on image objects information. For a deep dive please go to:

2006.08218.pdf (arxiv.org)

Bare in mind that comparisons on methods from a theoretical perspective are rare and many academics keep relying on empirical tests to validate which framework is most robust. The following paper shows why downstream task testing is the best we have so far...

2007.12446.pdf (arxiv.org)

It is therefore fair to say that, while the theoretical and empirical exercises are in line with our experience, we can not state that this method is always flawed for image feature extraction or object similarity.

Self-Supervised learning, SIMCLR and Moco

You may guess that the last option is our preffered one, and you are right. But please let me convince you : ) . As either waiting for a state of art pre-trained model exacty fitting our problem, or using a model such as the VAE not present in current benchmarks and with theoretical and empirical caviats was not on our taste, we search for self-supervised image representation frameworks.

As a starting point we found SIMCLR. The idea is simple. Pick your data set without labels, and for each image create augmentations of it. The model goal is to be able to find a vector space where the similarity between one image and its augmentations is high, while low in comparison to other images, which are not too easy to differentiate (for example comparing blue running shoes instead of running versus basketball shoes).

The model achieves state of the art in self supervised performance on the image net and very closed to resnet50 [1], and its framework can be leveraged on any data set, assuming a sufficient large amount of images and computing power is available. Having sufficient images without labels, given the vast amount of augmentations available does not seem a problem, but the computation cost and time required for such a model leave us in dispair.

Luckily, we are not the only ones, and Facebook.ai develop the MOCO framework[2], which allow us to build a similar architecture like the SIMCLR, but leveraging dynamic image dictionaries and the learning trick called "momentum", we are able to reduce massively the computational burden, the amount of gpus and batch size and therefore we can experiment widely with a moderate budget [3].

In the next post I will deep dive into SIMCLR and why is so computationally expensive, so later we can deep dive in MOCO architecture, results and remaining challenges. As for our applications we required unbiased, interpretable and lower dimensional image representations, we will need to work a little bit on the layer we pool from Moco (~1000length) to use it as feature for our downstream tasks. This post series will finish explaining how we are planning to solve that.

[1] https://arxiv.org/pdf/2002.05709.pdf

[2] https://arxiv.org/pdf/1911.05722v3.pdf

[3] https://arxiv.org/pdf/2003.04297v1.pdf

Alan Fortuny Sicart

Search This Blog