Project 5: Diffusion Models

Part A: Using DeepFloyd

Sampling from the Model

Using random seed 42, we explore DeepFloyd IF's sampling capabilities with varying inference steps. Here we show results for the prompt "a rocket ship" with 20 vs 50 inference steps. We can observe that as inference steps increase, the image is more refined and contains more details even in the background.

20 Steps

20 inference steps

50 Steps

50 inference steps

Forward Diffusion

This part includes the forward diffusion process that progressively adds noise to images. It follows the equation:

\[ x_t = \sqrt{\bar{\alpha}_t}x_0 + \sqrt{1-\bar{\alpha}_t}\epsilon \quad \text{where} \quad \epsilon \sim \mathcal{N}(0,1) \]
t=250

Noise level t=250

t=500

Noise level t=500

t=750

Noise level t=750

Classical Denoising

In this part I tried denoising using classical Gaussian blur filtering, the results show its limitations in high noise level.

Original t=250 Denoised t=250

Original vs Denoised (t=250)

Original t=500 Denoised t=500

Original vs Denoised (t=500)

Original t=750 Denoised t=750

Original vs Denoised (t=750)

One Step Denoising

In this part, I used a pretrained UNet to perform one-step denoising with the prompt "a high quality photo". For each timestep, we visualize the original image, noisy version, and denoised result.

Original t=250 Noisy t=250 UNet Denoised t=250

t=250: Original → Noisy → Denoised

Original t=500 Noisy t=500 UNet Denoised t=500

t=500: Original → Noisy → Denoised

Original t=750 Noisy t=750 UNet Denoised t=750

t=750: Original → Noisy → Denoised

Iterative Denoising

While one-step denoising shows improvement over gaussian blur, the quality still degrades with higher noise levels. Diffusion models are designed to work iteratively, theoretically starting from pure noise x₁₀₀₀ and progressively denoising to x₀. However, running 1000 denoising steps is computationally expensive.

We implement a more efficient approach using strided timesteps (step size 30, from 990 to 0), where each step follows the interpolation formula:

\[ x_{t'} = \frac{\sqrt{\bar{\alpha}_{t'}}\beta_t}{1-\bar{\alpha}_t}x_0 + \frac{\sqrt{\alpha_t}(1-\bar{\alpha}_{t'})}{1-\bar{\alpha}_t}x_t + v_\sigma \]

Here, x_t is the current noisy image, x_t' is the less noisy target, and α_t = ᾱ_t/ᾱ_t'. This effectively interpolates between signal and noise estimates, with the variance term v_σ predicted by the model. Starting from timestep 10, we show that the iterative denoising results compared to one-step and Gaussian approaches.

Iterative Denoising Progress

Noisy Campanile at t=690 Noisy Campanile at t=540 Noisy Campanile at t=390 Noisy Campanile at t=240 Noisy Campanile at t=90

Progressive denoising steps t = (690, 540, 390, 240, 90)

Method Comparison

Iterative Denoised One-step Denoised Gaussian Blurred

Iterative vs One-step vs Gaussian

Diffusion Model Sampling

While previous sections focused on denoising given images, diffusion models can also generate images from scratch. Starting with pure random noise (i_start = 0), we can apply our iterative denoising process to gradually form new images. Using the prompt "a high quality photo", we demonstrate the model's ability to generate new images from random initialization.

Sample 1

Sample 1

Sample 2

Sample 2

Sample 3

Sample 3

Sample 4

Sample 4

Sample 5

Sample 5

Classifier Free Guidance

To improve generation quality, we implement Classifier-Free Guidance (CFG), which combines conditional and unconditional noise estimates. The noise estimate is computed as:

\[ \epsilon = \epsilon_u + \gamma(\epsilon_c - \epsilon_u) \]

Here, ϵᵤ is the unconditional noise estimate (using an empty prompt ""), ϵ_c is the conditional estimate (using our desired prompt), and γ controls the guidance strength. When γ > 1, the model produces higher quality but less diverse images. We demonstrate results using γ = 7 with the conditional prompt "a high quality photo".

CFG Sample 1

CFG Sample 1

CFG Sample 2

CFG Sample 2

CFG Sample 3

CFG Sample 3

CFG Sample 4

CFG Sample 4

CFG Sample 5

CFG Sample 5

Image-to-image Translation

In this part, we use diffusion model to edit existing images through controlled noising and denoising. When we add noise to an image and then denoise it, the model must "hallucinate" to reconstruct details, effectively projecting the image back onto the natural image manifold. The amount of noise added controls how much the output can deviate from the input.

Following the SDEdit algorithm, we apply varying levels of noise and use the prompt "a high quality photo" to guide the reconstruction. Higher i_start preserve more of the original image, and vice versa.

Test Image Edits

Step 1

i_start = 1

Step 3

i_start = 3

Step 5

i_start = 5

Step 7

i_start = 7

Step 10

i_start = 10

Step 20

i_start = 20

Apple Image Edits

Custom 1 Step 1

i_start = 1

Custom 1 Step 3

i_start = 3

Custom 1 Step 5

i_start = 5

Custom 1 Step 7

i_start = 7

Custom 1 Step 10

i_start = 10

Custom 1 Step 20

i_start = 20

Orange Image Edits

Custom 2 Step 1

i_start = 1

Custom 2 Step 3

i_start = 3

Custom 2 Step 5

i_start = 5

Custom 2 Step 7

i_start = 7

Custom 2 Step 10

i_start = 10

Custom 2 Step 20

i_start = 20

Editing Hand-Drawn and Web Images

The image-to-image translation technique is particularly effective when starting with non-realistic images like sketches or paintings. The diffusion model helps project these abstract representations onto the natural image manifold, creating photorealistic interpretations while maintaining key elements.

Web Image Edit

Web Original

Original

Web Step 1

i_start = 1

Web Step 3

i_start = 3

Web Step 5

i_start = 5

Web Step 7

i_start = 7

Web Step 10

i_start = 10

Web Step 20

i_start = 20

Hand-Drawn Image 1

Hand 1 Original

Original

Hand 1 Step 1

i_start = 1

Hand 1 Step 3

i_start = 3

Hand 1 Step 5

i_start = 5

Hand 1 Step 7

i_start = 7

Hand 1 Step 10

i_start = 10

Hand 1 Step 20

i_start = 20

Hand-Drawn Image 2

Hand 2 Original

Original

Hand 2 Step 1

i_start = 1

Hand 2 Step 3

i_start = 3

Hand 2 Step 5

i_start = 5

Hand 2 Step 7

i_start = 7

Hand 2 Step 10

i_start = 10

Hand 2 Step 20

i_start = 20

Inpainting

We implement inpainting following the RePaint paper, where we selectively regenerate masked regions while preserving the rest of the image. Given an image x_orig and a binary mask m, at each denoising step we combine:

\[ x_t \leftarrow \mathbf{m}x_t + (1-\mathbf{m})\text{forward}(x_{\text{orig}}, t) \]

This ensures that unmasked regions (m=0) maintain their original content with appropriate noise levels, while masked regions (m=1) are generated through the diffusion process.

Test Image Inpainting

Original

Original Image

Mask

Mask

Result

Inpainted Result

Custom Image 1

Custom 1 Original

Original Image

Custom 1 Mask

Mask

Custom 1 Result

Inpainted Result

Custom Image 2

Custom 2 Original

Original Image

Custom 2 Mask

Mask

Custom 2 Result

Inpainted Result

Text-Conditioned Image-to-image Translation

Building on our previous image-to-image translation, we now add text conditioning to guide the projection process. Instead of using a neutral prompt, we use specific text prompts (in this example "a rocket ship") to influence the reconstruction direction.

Test Image"

Step 1

i_start = 1

Step 3

i_start = 3

Step 5

i_start = 5

Step 7

i_start = 7

Step 10

i_start = 10

Step 20

i_start = 20

Custom Image 1

Custom 1 Step 1

i_start = 1

Custom 1 Step 3

i_start = 3

Custom 1 Step 5

i_start = 5

Custom 1 Step 7

i_start = 7

Custom 1 Step 10

i_start = 10

Custom 1 Step 20

i_start = 20

Custom Image 2

Custom 2 Step 1

i_start = 1

Custom 2 Step 3

i_start = 3

Custom 2 Step 5

i_start = 5

Custom 2 Step 7

i_start = 7

Custom 2 Step 10

i_start = 10

Custom 2 Step 20

i_start = 20

Visual Anagrams

We create optical illusions that reveal different images when viewed upright versus upside down. The technique combines noise estimates from two different prompts:

\[ \begin{aligned} \epsilon_1 &= \text{UNet}(x_t, t, p_1) \\ \epsilon_2 &= \text{flip}(\text{UNet}(\text{flip}(x_t), t, p_2)) \\ \epsilon &= (\epsilon_1 + \epsilon_2)/2 \end{aligned} \]

By averaging the noise estimates from normal and flipped orientations, we create images that blend two different interpretations.

Anagram 1: Old Man ↔ Campfire

Upright Old Man

"an oil painting of an old man"

Flipped Campfire

"an oil painting of people around a campfire"

Anagram 2: Mountain Village ↔ College Students

Upright Mountain

"an oil painting of a snowy mountain village"

Flipped Students

"an oil painting of college students"

Anagram 3: Skull ↔ Rose

Upright Skull

"a lithograph of a skull"

Flipped Rose

"a lithograph of a rose"

Hybrid Images

We implement Factorized Diffusion to create hybrid images that appear different when viewed from different distances. The technique combines low-frequency components from one prompt with high-frequency components from another using the following algorithm:

\[ \begin{aligned} \epsilon_1 &= \text{UNet}(x_t, t, p_1) \\ \epsilon_2 &= \text{UNet}(x_t, t, p_2) \\ \epsilon &= f_{\text{lowpass}}(\epsilon_1) + f_{\text{highpass}}(\epsilon_2) \end{aligned} \]

Using Gaussian blur (kernel size 33, sigma 2) for frequency separation, we create images that reveal different content at different viewing distances.

Skull/Waterfall Hybrid

Far: "a lithograph of a skull" / Close: "a lithograph of waterfalls"

Car/Campfire Hybrid

Far: "an oil painting of a car" / Close: "an oil painting of people around a campfire" (Think of the two white blocks in the middle right of the image as car lights.)

Mountain/Baby Hybrid

Far: "a photo of a mountain" / Close: "a photo of a baby crying"

Part B: Training Diffusion Models

Unconditional UNet

Implementation and training of a basic unconditional UNet for denoising MNIST digits. The training objective is to denoise images with σ = 0.5 noise level, optimizing:

\[ L = \mathbb{E}_{z,x}\|D_\theta(z) - x\|^2 \]

Training Parameters:

Images are noised only when fetched from the dataloader to improve generalization through new noise patterns in each epoch.

UNet Architecture

UNet Architecture

Unconditional UNet architecture

UNet Operations

Standard UNet Operations

Noising Process Visualization

Noising Process

Visualization of noising process with σ=[0.0, 0.2, 0.4, 0.5, 0.6, 0.8, 1.0]

Training Progress

Training Loss

Training loss curve over iterations

Sample Results

Epoch 1 Results

Results after first epoch

Epoch 5 Results

Results after fifth epoch

Out-of-Distribution Testing

Out of Distribution Results

Results on varying noise levels σ=[0.0, 0.2, 0.4, 0.5, 0.6, 0.8, 1.0]

Time-Conditioned UNet

Extension of the UNet with time conditioning for iterative denoising. The model is conditioned on timestep t through fully connected blocks that inject timing information at multiple resolutions. By conditioning the unflatten and upsampling operations with learned time embeddings, the network learns to handle varying noise levels and can perform iterative denoising.

Training Parameters:

\[ L = \mathbb{E}_{\epsilon,x_0,t}\|\epsilon_\theta(x_t, t) - \epsilon\|^2 \]

Training Algorithm

Time UNet Training

Sampling Algorithm

Time UNet Sampling
Training Loss

Epoch 5 Results

Epoch 20 Results

Class-Conditioned UNet

Extension of the time-conditioned UNet by adding class conditioning. The model receives both time t and class c information through parallel FCBlocks. The class input is converted to one-hot vectors with 10% dropout probability for unconditioned training. Similar to time conditioning, both unflatten and upsampling operations are modulated with learned time and class embeddings, this enables generation of specific digits.

The model uses the same training parameters as the time-conditioned UNet.

Training Algorithm

Class UNet Training

Sampling Algorithm

Class UNet Sampling
Class-Conditioned Training Loss

Class-Conditioned Samples

Class-Conditioned Samples