FA24 CS180 - Project 5: Diffusion Models

Sampling from the Model

Using random seed 42, we explore DeepFloyd IF's sampling capabilities with varying inference steps. Here we show results for the prompt "a rocket ship" with 20 vs 50 inference steps. We can observe that as inference steps increase, the image is more refined and contains more details even in the background.

20 inference steps

50 inference steps

Forward Diffusion

This part includes the forward diffusion process that progressively adds noise to images. It follows the equation:

x_t = \sqrt{\bar{\alpha}_t}x_0 + \sqrt{1-\bar{\alpha}_t}\epsilon \quad \text{where} \quad \epsilon \sim \mathcal{N}(0,1)

Noise level t=250

Noise level t=500

Noise level t=750

Classical Denoising

In this part I tried denoising using classical Gaussian blur filtering, the results show its limitations in high noise level.

Original vs Denoised (t=250)

Original vs Denoised (t=500)

Original vs Denoised (t=750)

One Step Denoising

In this part, I used a pretrained UNet to perform one-step denoising with the prompt "a high quality photo". For each timestep, we visualize the original image, noisy version, and denoised result.

t=250: Original → Noisy → Denoised

t=500: Original → Noisy → Denoised

t=750: Original → Noisy → Denoised

Iterative Denoising

While one-step denoising shows improvement over gaussian blur, the quality still degrades with higher noise levels. Diffusion models are designed to work iteratively, theoretically starting from pure noise x₁₀₀₀ and progressively denoising to x₀. However, running 1000 denoising steps is computationally expensive.

We implement a more efficient approach using strided timesteps (step size 30, from 990 to 0), where each step follows the interpolation formula:

x_{t'} = \frac{\sqrt{\bar{\alpha}_{t'}}\beta_t}{1-\bar{\alpha}_t}x_0 + \frac{\sqrt{\alpha_t}(1-\bar{\alpha}_{t'})}{1-\bar{\alpha}_t}x_t + v_\sigma

Here, x_t is the current noisy image, x_t' is the less noisy target, and α_t = ᾱ_t/ᾱ_t'. This effectively interpolates between signal and noise estimates, with the variance term v_σ predicted by the model. Starting from timestep 10, we show that the iterative denoising results compared to one-step and Gaussian approaches.

Iterative Denoising Progress

Progressive denoising steps t = (690, 540, 390, 240, 90)

Method Comparison

Iterative vs One-step vs Gaussian

Diffusion Model Sampling

While previous sections focused on denoising given images, diffusion models can also generate images from scratch. Starting with pure random noise (i_start = 0), we can apply our iterative denoising process to gradually form new images. Using the prompt "a high quality photo", we demonstrate the model's ability to generate new images from random initialization.

Sample 1

Sample 2

Sample 3

Sample 4

Sample 5

Classifier Free Guidance

To improve generation quality, we implement Classifier-Free Guidance (CFG), which combines conditional and unconditional noise estimates. The noise estimate is computed as:

\epsilon = \epsilon_u + \gamma(\epsilon_c - \epsilon_u)

Here, ϵᵤ is the unconditional noise estimate (using an empty prompt ""), ϵ_c is the conditional estimate (using our desired prompt), and γ controls the guidance strength. When γ > 1, the model produces higher quality but less diverse images. We demonstrate results using γ = 7 with the conditional prompt "a high quality photo".

CFG Sample 1

CFG Sample 2

CFG Sample 3

CFG Sample 4

CFG Sample 5

Image-to-image Translation

In this part, we use diffusion model to edit existing images through controlled noising and denoising. When we add noise to an image and then denoise it, the model must "hallucinate" to reconstruct details, effectively projecting the image back onto the natural image manifold. The amount of noise added controls how much the output can deviate from the input.

Following the SDEdit algorithm, we apply varying levels of noise and use the prompt "a high quality photo" to guide the reconstruction. Higher i_start preserve more of the original image, and vice versa.

Test Image Edits

i_start = 1

i_start = 3

i_start = 5

i_start = 7

i_start = 10

i_start = 20

Apple Image Edits

i_start = 1

i_start = 3

i_start = 5

i_start = 7

i_start = 10

i_start = 20

Orange Image Edits

i_start = 1

i_start = 3

i_start = 5

i_start = 7

i_start = 10

i_start = 20

Editing Hand-Drawn and Web Images

The image-to-image translation technique is particularly effective when starting with non-realistic images like sketches or paintings. The diffusion model helps project these abstract representations onto the natural image manifold, creating photorealistic interpretations while maintaining key elements.

Web Image Edit

Original

i_start = 1

i_start = 3

i_start = 5

i_start = 7

i_start = 10

i_start = 20

Hand-Drawn Image 1

Original

i_start = 1

i_start = 3

i_start = 5

i_start = 7

i_start = 10

i_start = 20

Hand-Drawn Image 2

Original

i_start = 1

i_start = 3

i_start = 5

i_start = 7

i_start = 10

i_start = 20

Inpainting

We implement inpainting following the RePaint paper, where we selectively regenerate masked regions while preserving the rest of the image. Given an image x_orig and a binary mask m, at each denoising step we combine:

x_t \leftarrow \mathbf{m}x_t + (1-\mathbf{m})\text{forward}(x_{\text{orig}}, t)

This ensures that unmasked regions (m=0) maintain their original content with appropriate noise levels, while masked regions (m=1) are generated through the diffusion process.

Test Image Inpainting

Original Image

Mask

Inpainted Result

Custom Image 1

Original Image

Mask

Inpainted Result

Custom Image 2

Original Image

Mask

Inpainted Result

Text-Conditioned Image-to-image Translation

Building on our previous image-to-image translation, we now add text conditioning to guide the projection process. Instead of using a neutral prompt, we use specific text prompts (in this example "a rocket ship") to influence the reconstruction direction.

Test Image"

i_start = 1

i_start = 3

i_start = 5

i_start = 7

i_start = 10

i_start = 20

Custom Image 1

i_start = 1

i_start = 3

i_start = 5

i_start = 7

i_start = 10

i_start = 20

Custom Image 2

i_start = 1

i_start = 3

i_start = 5

i_start = 7

i_start = 10

i_start = 20

Visual Anagrams

We create optical illusions that reveal different images when viewed upright versus upside down. The technique combines noise estimates from two different prompts:

\begin{aligned} \epsilon_1 &= \text{UNet}(x_t, t, p_1) \\ \epsilon_2 &= \text{flip}(\text{UNet}(\text{flip}(x_t), t, p_2)) \\ \epsilon &= (\epsilon_1 + \epsilon_2)/2 \end{aligned}

By averaging the noise estimates from normal and flipped orientations, we create images that blend two different interpretations.

Anagram 1: Old Man ↔ Campfire

"an oil painting of an old man"

"an oil painting of people around a campfire"

Anagram 2: Mountain Village ↔ College Students

"an oil painting of a snowy mountain village"

"an oil painting of college students"

Anagram 3: Skull ↔ Rose

"a lithograph of a skull"

"a lithograph of a rose"

Hybrid Images

We implement Factorized Diffusion to create hybrid images that appear different when viewed from different distances. The technique combines low-frequency components from one prompt with high-frequency components from another using the following algorithm:

\begin{aligned} \epsilon_1 &= \text{UNet}(x_t, t, p_1) \\ \epsilon_2 &= \text{UNet}(x_t, t, p_2) \\ \epsilon &= f_{\text{lowpass}}(\epsilon_1) + f_{\text{highpass}}(\epsilon_2) \end{aligned}

Using Gaussian blur (kernel size 33, sigma 2) for frequency separation, we create images that reveal different content at different viewing distances.

Far: "a lithograph of a skull" / Close: "a lithograph of waterfalls"

Far: "an oil painting of a car" / Close: "an oil painting of people around a campfire" (Think of the two white blocks in the middle right of the image as car lights.)

Far: "a photo of a mountain" / Close: "a photo of a baby crying"

Unconditional UNet

Implementation and training of a basic unconditional UNet for denoising MNIST digits. The training objective is to denoise images with σ = 0.5 noise level, optimizing:

L = \mathbb{E}_{z,x}\|D_\theta(z) - x\|^2

Training Parameters:

Dataset: MNIST via torchvision.datasets.MNIST
Batch Size: 256
Epochs: 5
Model: UNet with hidden dimension D = 128
Optimizer: Adam with learning rate 1e-4
Noise Level: σ = 0.5

Images are noised only when fetched from the dataloader to improve generalization through new noise patterns in each epoch.

UNet Architecture

Unconditional UNet architecture

Standard UNet Operations

Noising Process Visualization

Visualization of noising process with σ=[0.0, 0.2, 0.4, 0.5, 0.6, 0.8, 1.0]

Training Progress

Training loss curve over iterations

Sample Results

Results after first epoch

Results after fifth epoch

Out-of-Distribution Testing

Results on varying noise levels σ=[0.0, 0.2, 0.4, 0.5, 0.6, 0.8, 1.0]

Time-Conditioned UNet

Extension of the UNet with time conditioning for iterative denoising. The model is conditioned on timestep t through fully connected blocks that inject timing information at multiple resolutions. By conditioning the unflatten and upsampling operations with learned time embeddings, the network learns to handle varying noise levels and can perform iterative denoising.

Training Parameters:

Dataset: MNIST via torchvision.datasets.MNIST
Batch Size: 128
Epochs: 20 (increased from Part A due to task complexity)
Model: Time-conditioned UNet with hidden dimension D = 64
Optimizer: Adam with initial learning rate 1e-3
Learning Rate Scheduler: ExponentialLR with gamma = 0.1^(1.0/num_epochs)

L = \mathbb{E}_{\epsilon,x_0,t}\|\epsilon_\theta(x_t, t) - \epsilon\|^2

Training Algorithm

Sampling Algorithm

Class-Conditioned UNet

Extension of the time-conditioned UNet by adding class conditioning. The model receives both time t and class c information through parallel FCBlocks. The class input is converted to one-hot vectors with 10% dropout probability for unconditioned training. Similar to time conditioning, both unflatten and upsampling operations are modulated with learned time and class embeddings, this enables generation of specific digits.

The model uses the same training parameters as the time-conditioned UNet.

Training Algorithm

Sampling Algorithm

Project 5: Diffusion Models

Part A: Using DeepFloyd

Sampling from the Model

Forward Diffusion

Classical Denoising

One Step Denoising

Iterative Denoising

Iterative Denoising Progress

Method Comparison

Diffusion Model Sampling

Classifier Free Guidance

Image-to-image Translation

Test Image Edits

Apple Image Edits

Orange Image Edits

Editing Hand-Drawn and Web Images

Web Image Edit

Hand-Drawn Image 1

Hand-Drawn Image 2

Inpainting

Test Image Inpainting

Custom Image 1

Custom Image 2

Text-Conditioned Image-to-image Translation

Test Image"

Custom Image 1

Custom Image 2

Visual Anagrams

Anagram 1: Old Man ↔ Campfire

Anagram 2: Mountain Village ↔ College Students

Anagram 3: Skull ↔ Rose

Hybrid Images

Part B: Training Diffusion Models

Unconditional UNet

UNet Architecture

Noising Process Visualization

Training Progress

Sample Results

Out-of-Distribution Testing

Time-Conditioned UNet

Training Algorithm

Sampling Algorithm

Class-Conditioned UNet

Training Algorithm

Sampling Algorithm