Photo-realistic image rendering using standard graphics techniques requires realistic simulation of geometry and light. The algorithms which we use currently for the task are effective but expensive. If we were able to render photo-realistic images using a model learned from data, we could turn the process of graphics rendering into a model learning and inference problem. This task can be modeled as an image-to-image translation model that can be used to create vivid virtual worlds by simply training it on new datasets.

A simple example of an image-to-image translation task where the edges of an object are fed as an input to the model and the model generates the actual object.

The current state of art image-to-image translation Deep learning models are not able to generate high-resolution images because of various limitations. These include increased computational complexity for higher-dimensional images as well as a lack of ability to generalize as image size increases. 

This paper addresses the 2 main concerns of vanilla adversarial learning i.e. 1) Difficulty in the generation of high-resolution images 2) Lack of details and realistic textures in generated high-resolution pictures.

Understanding image to image translation task

Before we jump into learning more about what an image-to-image translation task is, we must first understand how machine learning problems are formulated. It is virtually impossible to come up with a Deep learning model without formally defining the task it is supposed to perform. For image generation applications, researchers have formulated the task in such a way that the model is supposed to generate an image given another image. 

This may sound daunting at first but if you abstract the entire definition of image-to-image translation as much as possible, it all boils down to a simple mathematical function.

In this image, X is the image that is input into our Deep learning model. The function F(x) that takes as input an image is nothing but our deep learning model. Furthermore, Y is the generated image which is the output of our deep learning model. Since X and Y can be completely different domains, we often say that our model maps images from one domain to the other. For example, in the image below, you can see that the input and output images are completely different. Hence we can say that they belong to completely different domains.

How are these functions/models designed?

The most popular approach to image-to-image translation problems these days is using adversarial learning. Adversarial learning involves 2 deep learning models brawling out in the open using all the data and making themselves better in the process. Standard adversarial learning consists of two networks that compete against each other. One of the networks is called a generator and the other is called a discriminator.

For image-to-image translation problems, the competition is framed in such a way that the generator tries to fool the discriminator whereas the discriminator attempts to tell apart the generated images versus real images. I will explain this with the help of an image as an example.

The input to the generator network is an image that needs to be mapped to its corresponding image of the other domain. Once the input image is fed into the generator network, we get generated images from the other end. After we have the generated images available with us, we shuffle the real and fake images and feed them into the discriminator. The discriminator then tries to tell apart the real from the fake images. 

What is Pix2Pix then?

Pix2Pix can be thought of as the function that translates/transforms/maps a given image X to its corresponding image Y. Technically, it consists of the generator as well as discriminator networks but you would often find that the literature uses Pix2Pix and its generator i.e. UNet interchangeably.

Like any other GAN, Pix2Pix consists of a discriminator network (PatchGAN classifier) and a generator network (slightly modified UNet). You can find the PyTorch implementation of Pix2Pix in this repository.


A U-Net consists of an encoder (downsampler) and decoder (upsampler). In the image below, the encoder is responsible for downsampling (reducing dimension and increasing channels) of the image whereas the upsampler performs the opposite functionality.

  • Each block in the encoder is: Convolution -> Batch normalization -> Leaky ReLU
  • Each block in the decoder is: Transposed convolution -> Batch normalization -> Dropout (applied to the first 3 blocks) -> ReLU
  • There are skip connections between the encoder and decoder (as in the U-Net).





The PatchGAN network tries to classify if each image patch is real or not real, as described in the pix2pix paper.

  • Each block in the discriminator is: Convolution -> Batch normalization -> Leaky ReLU.
  • The shape of the output after the last layer is (batch_size, 30, 30, 1).
  • Each 30 x 30 image patch of the output classifies a 70 x 70 portion of the input image.
  • The discriminator receives 2 inputs:
    • The input image and the target image, which it should classify as real.
    • The input image and the generated image (the output of the generator), which it should classify as fake.

Merits and Demerits of the standard pix2pix model

The standard pix2pix model is highly effective in performing image-to-image translation for images with smaller resolution ratios. The model was designed to perform translations across various unrelated domains. The model is highly robust and can generalize on the most difficult tasks given enough data. Furthermore, many researchers have gone ahead and created 3D models of the same to be trained on spatial data. 

However, this model has its own limitations as discussed at the starting of this blog. 

1) Difficulty in the generation of high-resolution images 2) Lack of details and realistic textures in generated high-resolution pictures. This arises due to the insufficient data existing in the lower layers of the encoder-decoder architecture. The lack of data in the encoded layers leads to problems in decoding the features. The decoder is not able to upsample the encoded features to a high-resolution image. 

The Nvidia Pix2PixHD paper addresses these papers by introducing a novel generator architecture, multi-scale discriminator architecture, and an additional loss term in the pre-existing loss function.

Contribution of Pix2PixHD

  1. Coarse-to-fine generator: The paper breaks down the current Unet into 2 different sub-networks to increase resolution: G1 and G2. At the heart of the generator lies G1 which is wrapped around by G2. The global generator network operates at a resolution of 1024 × 512, and the local enhancer network outputs an image with a resolution that is 4× the output size of the previous one (2× along each image dimension). For synthesizing images at an even higher resolution, additional local enhancer networks could be utilized. For example, the output image resolution of the generator G = {G1, G2} is 2048 × 1024, and the output image resolution of G = {G1, G2, G3} is 4096 × 2048.

The figure may seem confusing at the first glance but all it does is virtually decomposes a single Unet into 2 different generators.

A semantic label map of resolution 1024×512 is passed through G1 to output an image of resolution 1024 × 512. The local enhancer network consists of 3 components: a convolutional front-end, a set of residual blocks, and a transposed convolutional back-end G. The resolution of the input label map to G2 is 2048 × 1024. Different from the global generator network, the input to the residual block is the element-wise sum of two feature maps: the output feature map of G2’s convolutional frontend, and the last feature map of the back-end of the global generator network G1. This helps in integrating the global information from G1 to G2. 

During training, we first train the global generator and then train the local enhancer in the order of their resolutions. We then jointly fine-tune all the networks together. We use this generator design to effectively aggregate global and local information for the image synthesis task.

2. Multi-scale discriminators: High-resolution image synthesis requires the discriminator to have a large receptive field. This would require either a deeper network or larger convolutional kernels, both of which would increase the network capacity and potentially cause overfitting. 

Hence, we use 3 discriminators that have an identical network structure but operate at different image scales. We will refer to the discriminators as D1, D2, and D3. Specifically, we downsample the real and synthesized high-resolution images by a factor of 2 and 4 to create an image pyramid of 3 scales. The discriminators D1, D2 and D3 are then trained to differentiate real and synthesized images at the 3 different scales, respectively. Creating discriminators that look at an image with different levels of receptive field allows the generator to learn about the larger structures as well as the finer details of an image. 

Adding 3 discriminators would also change the loss function of the model. 

This function ensures that outputs of all 3 discriminators are considered before parameter update.

3. Improved adversarial loss: The standard GAN loss consists of a discriminator-induced loss as well as an MAE loss (calculated wrt to the target image). 

This function however doesn’t focus on the intermediate representations of the generator network. Allowing intermediate representations of the generator network to be induced into the loss functions enables the network to focus more on the intricate details during downsampling. This is done by extracting the intermediate representations of the discriminator and generator network and using a slightly modified MAE loss on the extractions.

T is the total number of layers and Ni denotes the number of elements in each layer.

The final loss function then becomes:

Code (check more from the official implementation)

  • Generator architecture


  • Enhanced loss function       

This code is added to the forward pass of the Pix2PixHD model. You can check out the complete code here.


Multiple discriminators

Working example

Please check this colab notebook for a working example.


The modified network was able to generate insanely realistic high resolution images. These images contained proper structures in the image and also included fine details that was absent earlier when training a vanilla pix2pix model. 

These are the images that were generated using the custom pix2pix model. The bottom left shows the instance map which was provided as an input to the pix2pix model and it was mapped to a realistic image of a face.

This is the synthetic image of a car’s view after providing the labeled instance map. As you might have noticed, the model is able to generate highly realistic images. The authors also propose to use this model to generate unique training examples by providing labeled data. This means that the authors plan on generating fake data. Availability of such a dataset would help researchers make giant leaps in domains where a huge amount of labeled training data is needed.