Using random seed 42, we explore DeepFloyd IF's sampling capabilities with varying inference steps. Here we show results for the prompt "a rocket ship" with 20 vs 50 inference steps. We can observe that as inference steps increase, the image is more refined and contains more details even in the background.
20 inference steps
50 inference steps
This part includes the forward diffusion process that progressively adds noise to images. It follows the equation:
Noise level t=250
Noise level t=500
Noise level t=750
In this part I tried denoising using classical Gaussian blur filtering, the results show its limitations in high noise level.
Original vs Denoised (t=250)
Original vs Denoised (t=500)
Original vs Denoised (t=750)
In this part, I used a pretrained UNet to perform one-step denoising with the prompt "a high quality photo". For each timestep, we visualize the original image, noisy version, and denoised result.
t=250: Original → Noisy → Denoised
t=500: Original → Noisy → Denoised
t=750: Original → Noisy → Denoised
While one-step denoising shows improvement over gaussian blur, the quality still degrades with higher noise levels. Diffusion models are designed to work iteratively, theoretically starting from pure noise x₁₀₀₀ and progressively denoising to x₀. However, running 1000 denoising steps is computationally expensive.
We implement a more efficient approach using strided timesteps (step size 30, from 990 to 0), where each step follows the interpolation formula:
Here, x_t is the current noisy image, x_t' is the less noisy target, and α_t = ᾱ_t/ᾱ_t'. This effectively interpolates between signal and noise estimates, with the variance term v_σ predicted by the model. Starting from timestep 10, we show that the iterative denoising results compared to one-step and Gaussian approaches.
Progressive denoising steps t = (690, 540, 390, 240, 90)
Iterative vs One-step vs Gaussian
While previous sections focused on denoising given images, diffusion models can also generate images from scratch. Starting with pure random noise (i_start = 0), we can apply our iterative denoising process to gradually form new images. Using the prompt "a high quality photo", we demonstrate the model's ability to generate new images from random initialization.
Sample 1
Sample 2
Sample 3
Sample 4
Sample 5
To improve generation quality, we implement Classifier-Free Guidance (CFG), which combines conditional and unconditional noise estimates. The noise estimate is computed as:
Here, ϵᵤ is the unconditional noise estimate (using an empty prompt ""), ϵ_c is the conditional estimate (using our desired prompt), and γ controls the guidance strength. When γ > 1, the model produces higher quality but less diverse images. We demonstrate results using γ = 7 with the conditional prompt "a high quality photo".
CFG Sample 1
CFG Sample 2
CFG Sample 3
CFG Sample 4
CFG Sample 5
In this part, we use diffusion model to edit existing images through controlled noising and denoising. When we add noise to an image and then denoise it, the model must "hallucinate" to reconstruct details, effectively projecting the image back onto the natural image manifold. The amount of noise added controls how much the output can deviate from the input.
Following the SDEdit algorithm, we apply varying levels of noise and use the prompt "a high quality photo" to guide the reconstruction. Higher i_start preserve more of the original image, and vice versa.
i_start = 1
i_start = 3
i_start = 5
i_start = 7
i_start = 10
i_start = 20
i_start = 1
i_start = 3
i_start = 5
i_start = 7
i_start = 10
i_start = 20
i_start = 1
i_start = 3
i_start = 5
i_start = 7
i_start = 10
i_start = 20
The image-to-image translation technique is particularly effective when starting with non-realistic images like sketches or paintings. The diffusion model helps project these abstract representations onto the natural image manifold, creating photorealistic interpretations while maintaining key elements.
Original
i_start = 1
i_start = 3
i_start = 5
i_start = 7
i_start = 10
i_start = 20
Original
i_start = 1
i_start = 3
i_start = 5
i_start = 7
i_start = 10
i_start = 20
Original
i_start = 1
i_start = 3
i_start = 5
i_start = 7
i_start = 10
i_start = 20
We implement inpainting following the RePaint paper, where we selectively regenerate masked regions while preserving the rest of the image. Given an image x_orig and a binary mask m, at each denoising step we combine:
This ensures that unmasked regions (m=0) maintain their original content with appropriate noise levels, while masked regions (m=1) are generated through the diffusion process.
Original Image
Mask
Inpainted Result
Original Image
Mask
Inpainted Result
Original Image
Mask
Inpainted Result
Building on our previous image-to-image translation, we now add text conditioning to guide the projection process. Instead of using a neutral prompt, we use specific text prompts (in this example "a rocket ship") to influence the reconstruction direction.
i_start = 1
i_start = 3
i_start = 5
i_start = 7
i_start = 10
i_start = 20
i_start = 1
i_start = 3
i_start = 5
i_start = 7
i_start = 10
i_start = 20
i_start = 1
i_start = 3
i_start = 5
i_start = 7
i_start = 10
i_start = 20
We create optical illusions that reveal different images when viewed upright versus upside down. The technique combines noise estimates from two different prompts:
By averaging the noise estimates from normal and flipped orientations, we create images that blend two different interpretations.
"an oil painting of an old man"
"an oil painting of people around a campfire"
"an oil painting of a snowy mountain village"
"an oil painting of college students"
"a lithograph of a skull"
"a lithograph of a rose"
We implement Factorized Diffusion to create hybrid images that appear different when viewed from different distances. The technique combines low-frequency components from one prompt with high-frequency components from another using the following algorithm:
Using Gaussian blur (kernel size 33, sigma 2) for frequency separation, we create images that reveal different content at different viewing distances.
Far: "a lithograph of a skull" / Close: "a lithograph of waterfalls"
Far: "an oil painting of a car" / Close: "an oil painting of people around a campfire" (Think of the two white blocks in the middle right of the image as car lights.)
Far: "a photo of a mountain" / Close: "a photo of a baby crying"
Implementation and training of a basic unconditional UNet for denoising MNIST digits. The training objective is to denoise images with σ = 0.5 noise level, optimizing:
Training Parameters:
Unconditional UNet architecture
Standard UNet Operations
Visualization of noising process with σ=[0.0, 0.2, 0.4, 0.5, 0.6, 0.8, 1.0]
Training loss curve over iterations
Results after first epoch
Results after fifth epoch
Results on varying noise levels σ=[0.0, 0.2, 0.4, 0.5, 0.6, 0.8, 1.0]
Extension of the UNet with time conditioning for iterative denoising. The model is conditioned on timestep t through fully connected blocks that inject timing information at multiple resolutions. By conditioning the unflatten and upsampling operations with learned time embeddings, the network learns to handle varying noise levels and can perform iterative denoising.
Training Parameters:
Extension of the time-conditioned UNet by adding class conditioning. The model receives both time t and class c information through parallel FCBlocks. The class input is converted to one-hot vectors with 10% dropout probability for unconditioned training. Similar to time conditioning, both unflatten and upsampling operations are modulated with learned time and class embeddings, this enables generation of specific digits.
The model uses the same training parameters as the time-conditioned UNet.