Computer Vision: Visual Representation with Sketches (Part 3) Pix2PixHD

When the images one is working with are of high resolution, or the demands of the application allow for very little margin of error (autonomous driving, medical analysis) it makes sense to consider high resolution image translation. This post aims to summarize the following paper:

1711.11585.pdf (arxiv.org)

High-Resolution Image Synthesis and Semantic Manipulation with Conditional GANs

Conditional GANs have been proven to provide fairly good translations from sketches to images, but its applications have been limited to lower resolution images. The following paper develops and algorithm that works on high resolution image translation 2048 × 1024 with very realistic generation.

Introduction

Creating realistic representations of the world is quite expensive computationally if every dimension and detail have to be model explictely. It becomes very necessary to find low weight approaches that could represent reality very realistic from a simple abstraction as an input.

The following papers uses semantic label maps as inputs and creates high definition images:

The following method addreses the main issues conditional GANs: the difficulty of generating high resolution images and the lack of details and realistic textures. To do so, it changes the adversarial learning objective together with new multi-scale generator and discriminator architectures.

There is an interesting feature about this network as it allows to change certain labels or objects of the input affecting the generated image. That could be an interesting application for product design.

Related Work

Generative adversarial networks model the natural image distribution by forcing the generated samples to be indistinguishable from natural images. Inspired by their successes, this paper propose a new coarse-to-fine generator and multi-scale discriminator architectures suitable for conditional image generation at a much higher resolution.

In image-to-image translaion , the goal is to translate an input image from one domain(sketch, label maps, segmentation maps) to another domain given input-output image pairs as training data.

Compared to L1 loss, which often leads to blurry images the adversarial loss has been used.

The reason is that the discriminator can learn a trainable loss function and automatically adapt to the differences between the generated and real images in the target domain.

Conditional GANs are problematic to generate high-resolution images due to the training instability and optimization issues. Changes in the loss function can overcome that problem and provide high resolution generated images.

There has been some successes in allowing users to interact with the image creation, but the models do not allow for clear disentanglement of objects (style transfer) or do not have high resolution generated images . The proposed framework overcome this two limitations.

Instance-Level Image Synthesis

The framework uses as a baseline the pix2pix model explained in Part 2 of this post series. We will jump right into how to increase realism and resolution.

We decompose the generator into two sub-networks: G1 and G2. We term G1 as the global generator network and G2 as the local enhancer network. The generator is then given by the tuple G = {G1, G2} as visualized in Fig. 3. The global generator network operates at a resolution of 1024 × 512, and the local enhancer network outputs an image with a resolution that is 4× the output size of the previous one (2× along each image dimension).

During training, we first train the global generator and then train the local enhancer in the order of their resolutions. We then jointly fine-tune all the networks together. We use this generator design to effectively aggregate global and local information for the image synthesis task. We note that such a multi-resolution pipeline is a well established practice in computer vision and two-scale is often enough.

Multi-scale discriminators High-resolution image synthesis poses a significant challenge to the GAN discriminator design. To differentiate high-resolution real and synthesized images, the discriminator needs to have a large receptive field. This would require either a deeper network or larger convolutional kernels, both of which would increase the network capacity and potentially cause overfitting. Also, both choices demand a larger memory footprint for training, which is already a scarce resource for high resolution image generation. To address the issue, we propose using multi-scale discriminators. We use 3 discriminators that have an identical network structure but operate at different image scales.

The discriminators D1, D2 and D3 are then trained to differentiate real and synthesized images at the 3 different scales, respectively. Although the discriminators have an identical architecture, the one that operates at the coarsest scale has the largest receptive field. It has a more global view of the image and can guide the generator to generate globally consistent images. On the other hand, the discriminator at the finest scale encourages the generator to produce finer details. This also makes training the coarse-to-fine generator easier, since extending a low resolution model to a higher resolution only requires adding a discriminator at the finest level, rather than retraining it from scratch.

We improve the GAN loss by incorporating a feature matching loss based on the discriminator. This loss stabilizes the training as the generator has to produce natural statistics at multiple scales. Specifically, we extract features from multiple layers of the discriminator and learn to match these intermediate representations from the real and the synthesized image. Our GAN discriminator feature matching loss is related to the perceptual loss, which has been shown to be useful for image super resolution and style transfer.Our full objective combines both GAN loss and feature matching loss.

For high resolution generations we need both semantic segmentation maps and object boundaries. The reason is that when objects of the same class are next to one another, looking at the semantic label map alone cannot tell them apart. To extract this information, we first compute the instance boundary map (Fig. 4b). The instance boundary map is then concatenated with the one-hot vector representation of the semantic label map, and fed into the generator network. Similarly, the input to the discriminator is the channel-wise concatenation of instance boundary map, semantic label map, and the real/synthesized image. Figure 5b shows an example demonstrating the improvement by using object boundaries.

To facilitate the manipulation of the generator at the object level, it is proposed to add additional low-dimensional feature channels as the input to the generator network. By manipulating these features, there is flexible control over the image synthesis process. Furthermore, note that since the feature channels are continuous quantities, our model is, in principle, capable of generating infinitely many images.To ensure the features are consistent within each instance, an instance-wise average pooling layer to the output of the encoder is added to compute the average feature for the object instance.

The encoder is jointly trained with the generators and discriminators. After the encoder is trained, we run it on all instances in the training images and record the obtained features. Then we perform a K-means clustering on these features for each semantic category. Each cluster thus encodes the features for a specific style, for example, the asphalt or cobblestone texture for a road. At inference time, we randomly pick one of the cluster centers and use it as the encoded features. These features are concatenated with the label map and used as the input to our generator.

Results

To quantify the quality of our results, we perform semantic segmentation on the synthesized images and compare how well the predicted segments match the input. The intuition is that if we can produce realistic images that correspond to the input label map, an off-the-shelf semantic segmentation model should be able to predict the ground truth label. Both pixel-wise accuracy and mean intersection-over-union (IoU), the method outperforms the other methods by a large margin. Moreover, is very close to the result of the original images, the theoretical “upper bound” of the realism we can achieve.

Subjective studies has been performed too where the pix2pixhd model outperform in the amount of fooling the people asked more than other methods.

The reader is encourage to look at figure 10-13 for quite impressive generated images.

All in all, this approach shows that is better than the pix2pix baseline for high resolution image creation and image translation. What remains to be compared is the computational burden with respect the baseline.

Alan Fortuny Sicart

Search This Blog