Computer Vision: Visual Representation with Sketches (Part 2) Pix2Pix Paper summary

The following post summarizes the following paper of the pix2pix model, a potential solution to create valid representations from sketches that are comparable from others that have images as input.

1611.07004.pdf (arxiv.org)

Image-to-Image Translation with Conditional Adversarial Networks

Conditional adverserial are a general purpose solution to image to image translation. It allows to reconstruct objects from sketches/edges maps, colorizing among many other tasks. The generic architecture of conditional adverserial NN seems to work well on a wide set of problems of image translation.

Introduction

The following method allows for the conversion of any input image type to the desired output image type, like we have seen with language models using the same architecture for multiple languages translation and other tasks.

The main challenge with such image tasks is the definition of the loss function, which is not trivial to define for a given problem, while it has several consequences on the results obtained. Instead of using CNNs and find the optimal loss function for our problem, we can use GANNS and frame it simply as: "I want to create a representation from the input as similar as possible from the target image".

GANs learn a loss that tries to classify if the output image is real or fake, while simultaneously training a generative model to minimize this loss. Blurry images will not be tolerated since they look obviously fake. Because GANs learn a loss that adapts to the data, they can be applied to a multitude of tasks that tradditionally would require very little kinds of loss functions.

This paper shows how conditional GANs can be a general purpose solution for image-to-image translation. A wide variety of problems have been tested to show how general this architecture is. The pix2pix framework is used to analyze this model approach on several data problems.

Related Work

Different to approaches where there was a pixel to pixel translation, assuming conditional independence, conditional GANs can penalize any structure that differs between the output and the target.

There are differences in the architectural choices of the model discussed in the paper, as the generator is based on a U-Net and the discriminator uses a Patch Gan classifier, which penalize the structure at the image patch size.

Method

GANs are generative models that learn a mapping from a random noise vector z to output image y. Conditional GANs learns a mapping from oberved image x and random noise vector z, to y. The generator G is trained to produce outputs that cannot be distinguished from real images by an adversarially trained discriminator, D, which is trained to detect the generators "fakes".

There is a tension in the loss function between the generators (try to fool the discriminator) and the discriminator(trying to catch the generator fake creations). To avoid the bluring, a L1 distance to the target image is added, as try to be minized.

As many models learn to pick up the random noise part z, it is recommended to add dropout during training on several layers. It remains open to the researchers how to best keep the full entropy of the conditional distributions, as some noise is always reduced after training.

The generator architecture assumes a similar strcuture between the input and target image. Umap is desired to keep low level features that are desired to be kept during all the layers of the network, such as high level edges.

Patch Gans discriminator uses only the high frequency structure to avoid learning collapse for slightly blurry sections. The L1 distance metric allow us to keep using the same architecture. Smaller than image patches still provides good discriminator accuracy while using less parameters and image surface.

The optimiation setting details can be find in the paper, will be skipped here.

Experiments

The following image illustrate the multiple experiments runs during the paper:

It is important to remark that great results have been obtained with as little as 100-400 samples and few hours of training on a single GPU, being the inference run time a few seconds.

It is hard to automatically add a metric to the performance of the representation without human validation. Segmentation models can be used on the actual and generated assets to validate how realistic or useful the generation is in keeping the information.

The following imgaes shows the importance of the loss function and the architecture for a good representation (first image shows the importance of using L1+GAN, the second the importance to add skip connections). Last but not least, the patch size affects the coloring, edges and overall realism of the generation.

It is remarkable that the generator was able to fool 20% of the participants in some questionnaires. On the contrary, it seems that for aspect segmentation conditional gans did not outbeat other approaches, as shown below:

In the following post, we will explore pix2pix hd, which is interesting for our use case as we are working with high fidelity images.

The repo of the pix2pix model can be found here:

GitHub - junyanz/pytorch-CycleGAN-and-pix2pix: Image-to-Image Translation in PyTorch

Alan Fortuny Sicart

Search This Blog