DALL-E – Creating images from text

Code – openai/DALL-E: PyTorch package for the discrete VAE used for DALL·E.

Paper – https://arxiv.org/pdf/2102.12092.pdf


What is DALL-E?

On 5th January 21, OpenAI unveiled their novel text to image generation model, DALL-E. This model is capable of generating various types of images from textual descriptions. A humongous 12 Billion parameter model takes text as an input and generates images as per its understanding.

For example,

“an illustration of an avocado in a helmet walking a dog.”

OpenAI has still not open-sourced the entire code base because they are still studying the vast societal implications this model will have if it is released. This model could possess the capability to disrupt journalism, graphic design, content writing, and stock photography with all the good and bad that entails.  

What does DALL-E achieve?

As discussed previously, DALL-E is tasked to generate images by a textual description of the frame. The network can create plausible images from scratch by reading compositional sentences. A few notable properties of the model have been discussed below.

  1. Ability to produce robust results even when peculiar sentences are fed

One common source of error in machine learning is the memorization of training data by a model. When a model memorizes the training data to improve its training score, it cannot generalize well to unseen data during testing/inference. Upon deeper inspection of the DALL-E model, researchers found that it could do exceptionally well even on bizarre composite sentences. It is improbable that the model had seen such sentences elsewhere. DALL-E can combine unrelated concepts to come up with something totally new.

For example,  

“A snail made of harp. A snail with the texture of a harp” was fed to the model.

   2. Improvising contextual details

When you listen to a sentence and form an image, there are many minor properties of the image you make up yourself. This is because a description of an image misses out on many tiny details in the picture. For example, if I say “a boy playing tennis”, I have not specified the boy’s t-shirt color. However, our brain comes up with an imaginary color to fill in the T-shirt. Thus, any contextual description of an image can have an infinite number of corresponding images. The researchers tested the model on such underspecified sentences and reported that the model could “imagine” realistic details without prior knowledge.

“A road sign with an image of a blue strawberry”

Notice that the model was not given information about the background, shape, and color of the sign and the time of the day. However, it could still come up with a decent visualization of the text. Even in this image, researchers did not provide many tiny details such as the color of the sign, background, font style, and font size.

“a storefront that has OpenAI written on it.”

3. Zero-Shot visual reasoning

A machine learning model is trained by giving it access to a massive amount of training data and optimizing its parameters to learn the distribution and generalize on unseen data. The primary goal of an ML model is to be reasonably robust and accurate on unseen examples. Even after trying extensively, models sometimes fail to generalize and produce poor test results.

The achievement of DALL-E, on the other hand, is astonishing. Zero-shot reasoning is the ability of a model to perform tasks for which they were not even designed. There is no particular training method to inculcate the ability of zero-shot reasoning. Similar to GPT-3, the DALL-E model can perform tasks that it was not built to do with reasonably good results. When a colossal model like DALL-E is trained on a large training dataset, one can never ascertain what the model is learning.

DALL-E can perform kinds of image-to-image translation when prompted with the correct input.  Note that researchers made no changes to the DALL-E model to increase the efficiency of image-to-image translation. The model evolved these on its own.

“The exact same teapot on the top with a heart on the bottom”

4. Knowledge of time 

When we think about objects from the past or future, we do not imagine them as entirely identical to day-to-day things. When we think about cars in the future, we envision electric vehicles with autonomous driving and fast transport. This is, however, far away from the current scenario. On the other hand, when we think of cars in the past, we imagine slow, inefficient, and noisy machines that emit many pollutants. 

Similarly, the DALL-E model understands the concepts of time and can imagine the differences between objects that arise due to the passage of time.

“A photo of a phone from ‘XYZ’ “ where ‘XYZ’ is the caption in the above images.

“A photo of a phone from ‘XYZ’ “ where ‘XYZ’ is the caption in the above images. (continued)

“A photo of a phone from ‘XYZ’ “ where ‘XYZ’ is the caption in the above images. (continued)

Model Architecture

DALL-E is an extension of the game-changing GPT-3 model released by the OPENAI team in July 2020. The GPT-3 has a mind-boggling 175 billion number of parameters, and it has been trained on hundreds of billions of words. 

From the preprint released by the team, we know that we have a simple decoder-only transformer that receives the text and image as a single stream of data at the backbone of this network. Since its discovery, transformers have been revolutionary in the field of Natural Language Processing.

However, training the network is a challenge in itself. The difficulty arises because, in an image, the transformer tends to focus on short-range dependencies between pixels rather than the low-lying structures. These low-lying structures give any image their perceptible form to a human eye.

To overcome this challenge, the researchers used a 2 phase training process that aims to maximize the evidence lower bound on the joint likelihood of the model’s distribution on images, captions, and the encoded tokens of the picture. 

First training phase: 

Researchers have trained a discrete variational auto-encoder responsible for generating the image tokens in the first training phase. Using each pixel of a 256*256 image is impractical because it increases the model’s size tremendously. Instead, an auto-encoder is used, which decomposes an image into a 32*32 grid. Each element of this grid can occupy 8192 values. This grid is then used as the image token data.

Second training phase:

This phase is where the huge 12 Billion parameter transformer model comes into play. The 32*32 = 1024 image tokens are concatenated with 256 text tokens. Here the image tokens are obtained by argmax sampling of the dvae encoder’s output logits. An autoregressive transformer is trained to model the joint distribution over the token data stream.

After the model is trained, it is used to generate images from captions. At a time, the model is asked to generate several images from a given caption. After gathering all these images, an additional CLIP model is used to collect the most realistic images from the newly generated batch of images for a particular caption.

Setup your own model

OpenAI has not yet open-sourced their code base for DALL-e because they are still testing the social implications of their new model. Some other people, however, have attempted to clone their work and have open-sourced the code. The cloned work achieves most of what OpenAI’s model has achieved however it does not produce as realistic results as the original DALL-E because of its sheer size. 

You can find the open-source implementation of DALL-E here. In the following section, we will discuss the code of the DALL-E model and how the various important functions work. You can read more about the discrete VAE here.

DALL-E Model Code

The author of this work has implemented the DALL-E model in PyTorch. If you are not familiar with PyTorch, then you may check out its documentation.

  • The Constructor (init function)

Like any other init function, the init function of DALL e class is responsible for the initialization and assignment of essential parameters of the model. Firstly, it checks whether the vae is an instance of the DiscreteVAE class type or not. Then it assigns embedding layers to the image and text tokens. 

Secondly, it adds up the number of tokens from the image and text stream data. It initializes a Transformer object with total number of outputs as the total number of tokens in the combined sequence. 

  • Generating images

This member function is decorated with the decorator torch.no_grad() to disable the computation of gradients in this step. This is because, at the time of generating images, the model need not be updated. This function would be used at the time of testing and generation of images. 

This function first slices the total text data up to the maximum possible length of sequence data which can be fed into the DALL-E model, which in this case equals to text_seq_len.

Next, it checks whether there is an image that we have fed into the model to be used as a baseline. In this blog, I showed you the results of a “teapot with a heart”. To generate this example, the researchers had fed into the model a black mug so that the model can have an idea of what the researchers are expecting. This img field is optional and can be used to generate more realistic images if needed.

If there is any such image, the dVAE is called and the image tokens are appended to the text tokens.

The final step in this process is feeding the token data into the forward pass of the DALL-E model which is done within a for loop in order to iterate over multiple captions. The image token data is then sliced from the model output and fed into the dVAE’s decoder. This decoder returns a decoded image from the given token data.

Optionally, CLIP is used to rank the images based on how realistic they are. 

  • Forward pass of the model

A forward pass of the DALL-E model requires text, image, mask, and return_loss as arguments. Note that only the text argument is a compulsory argument because, at the time of testing, we need not necessarily feed an image into the model.

If an image has been provided as an argument to the function, then we run our pretrained dVAE on it and generate the encoded token data from the image. The image token data is then concatenated with our text token data.

After we have our sequence of token data ready, we feed this into our simple transformer (the one that we initialized in the init function) object which models the data autoregressively. 

Now, in case we are not training, we say return_loss is false. In that case, this function returns the logits that are output by the transformer. In the case of training, return_loss is true and this calculates the loss and returns the total loss for one forward pass.

A few unique examples

These are a few unique examples of a pretrained model trained by Kobiso trying to replicate the original DALL-E.

  1. This bird has a very dull color.

  2. This bird is very shiny and red in color

  3. “A pink colored bird” 

Note that here ranking is given by the CLIP model

How does DALL-E change the game?

Previously, researchers have worked with text to image translation models and have come up with reasonably decent models that can generate images. However, previous models face issues with image quality as well as imagination [2]. Since these models try to model the short-range dependencies between pixels, they cannot generate objects correctly. Their things tend to be a big blob of pixels in the middle of the image.

DALL-E can model the objects correctly by rethinking the training process and splitting it into two phases since it focuses on the low-level structures within an image. Along with the difference in training procedure, since DALL-E has been trained on a massive amount of data, it can imagine and invent necessary details in case of bizarre sentences.