Technical overview of Image Synthesis : Stable Diffusion

Tex to Image models like DALL-E, Imagen, and Stable Diffusion have attracted a lot of attention to Image Synthesis models, recently. These models can generate impressive looking images from benign looking prompts. Here are a few typical examples of images from Stable Diffusion:

Cat wearing a hat in Van Gogh style

Dog wearing bermuda in Pablo Picasso Style

Looking under the hood of the NeuralNet we find a familiar structure. The system is made up of 3 components:

Stable Diffusion Components

Text Encoder (ClipText): Which produces embeddings of input text.
Image Information Creator (UNet + Scheduler): This component is where a lot of the performance gain over previous models is achieved. This component runs for multiple steps to generate image information.
The image information creator works completely in the image information space (or latent space) making it faster than previous diffusion models that worked in pixel space. In technical terms, this component is made up of a UNet neural network and a scheduling algorithm.
The word “diffusion” describes what happens in this component.
Image Decoder (Autoencoder Decoder): The image decoder creates a picture from the information it got from the information creator.

We will describe each of these components below:

Text Encoder

The text encoder generates token embeddings from text prompts. Stable Diffusion V2 uses OpenClip for its text encoder. Let us look at the CLIP model training in some detail to understand the text embeddings. The CLIP model architecture is described in the figure below:

CLIP model training

The model is trained on a dataset of images and their captions. Using the text encoder in CLIP produces embeddings such that an image of a dog and the sentence “a picture of a dog” have similar embeddings. This fact is essential when we try to synthesize an image in the decoder using the latent representation from the Unet output.

Image Information Creator

Image Generation from Diffusion Models

Diffusion models approach image generation by framing the problem as a noise prediction problem. Fundamentally, Diffusion Models work by destroying training data through the successive addition of Gaussian noise, and then learning to recover the data by reversing this noising process. After training, we can use the Diffusion Model to generate data by simply passing randomly sampled noise through the learned denoising process.

UNet Training

The Stable Diffusion paper runs the diffusion process not on the pixel images themselves, but on a compressed version of the image. The paper calls this “Departure to Latent Space”. This compression (and later decompression) is done via an autoencoder. The autoencoder compresses the image into the latent space using its encoder, then reconstructs it using only the compressed information using the decoder. The process can be described with the following figure:

Training for Stable Diffusion without Text Prompt

Unet Noise predictor with Text Conditioning

Note that the diffusion process described so far generates images without using any text data. So if we deploy this model, it would generate great looking images, but we’d have no way of controlling which image is generated. The process of adding the text prompt as input to the UNet is called text conditioning. This is done by adding an attention layer in the UNet as shown in the figure below:

Unet with Text Conditioning

So during training the model looks as follows:

During inference the image encoder is removed and the UNet is fed random noise as input. Below we will see how to install and run the Stable Diffusion model.

Running Stable Diffusion V1

Clone the git repository at https://github.com/CompVis/stable-diffusion. Then in the main directory find the environment.yaml file and create a conda environment as follows :


conda env create -f environment.yaml
conda activate ldm

conda env create -f environment.yaml

conda activate ldm

Inside the environment, run the following code for generating an image for a corresponding prompt:


from torch import autocast 
from diffusers import StableDiffusionPipeline 
pipe = StableDiffusionPipeline.from_pretrained( "CompVis/stable-diffusion-v1-4", use_auth_token=False ).to("cuda") 
prompt = "a photo of an astronaut riding a horse on mars" 
with autocast("cuda"): 
    image = pipe(prompt).images[0] 
image.save("astronaut_rides_horse.png")