Skip to main content

Why the present and future of image representation is self supervised (Part 3)

 Momentum Contrast for Unsupervised Visual Representation Learning(MOCOv2)

In the previous post, we present SIMCLR as one of the most promising self-supervised image representation methods. As the caviat of long training and large batch sizes could be too much for many applications, Mocov2 and its subsequent versions are both exploration.

Introduction

Self supervised methods in NLP are the norm, and almost how they have been built. Language tasks haven been made more and more efficient with the usage of dictionaries and keys on attention models, where memory is manage very efficiently on very large texts. 

The obvious continouous, highly dimensional nature of image data should not be a block for leveraging the learnings from NLP in self supervised tasks and representations in computer vision. Moco is a great proof of how it can be done.

If we reframe keys and tokens in the forms of group of images that are similar in the same query and disimilar to other queries we are exactly doing that.



Momentum contrast allows to build large dictionaries that encode queries an try to model properly the similarity between key. Think of the dictionary as a queue of samples that dynamically get updated. By slowly adapting the key encoder we are ensuring that learning and memory remains (old key encoding does not fade away).

MoCo provides representations that differ for negative augmentations of the anchor image and its augmentations.Using a pretext task very similar than in SIMCLR,we are able to get competitive results on the Image Net  and also real world data sts like Instagram data.


Method

The contrastive learning task at hand can be understood as an image dictionary lookup task.The contrastive loss function will be low for a positive key (same image but augmented) and high for the others keys (other images, with augmentations).

The expectation here is that a good representation comes from processing large dictionaries of negative samples where the encoder is updated slowly as sample passed. Older mini batches are not store and its learning is kept in the encoder as it changes slowly with new queries.

That have implications on the amount of memory required,as we do not store previous samples in a memory bank fashion, allowing for very large training data sets with less memory that other dictionary management approaches.




As in SIMCLR augmenations are performed as part ofthe algorithm and a Resnet model is used as an encoder,with a final global pulling layer of dimension 128 to encode the information of the image is a smaller form. Avoiding information leakage during the batch creation is a relevant as with SIMCLR, and for each GPU batch normalization is implemented.

Results


The second paper of the MoCo v2 provide direct comparison with SIMCLR and memory requirements:

Moco provides similar or better results with smaller batch sizes, epochs and memory requirements for GPU. This is why we go for Moco moving forward.



Comments

Popular posts from this blog

Alternative media training : Digital socialism

The evolution of technology in the 20th century brought about a form of relative emancipation—but also reached its most horrific expression in the tools used for mass murder during the Holocaust. After World War II, a new promise emerged: that integrated capital markets would bring peace and prosperity for all. However, technological infrastructures were quickly privatized. By the 1970s, communication providers had become powerful corporations. Since then, most technological investment has been directed toward enabling the financialization of the economy—allowing speculative transactions to be executed at ever faster speeds and on ever greater scales. This process culminated, though did not end, with the financial crash of 2008. Rather than questioning the inability of capital markets to reach equilibrium or provide equitable services, neoliberalism doubled down—further privatizing knowledge and social exchange through platforms like Google and Facebook. What we need today is the devel...