Mason Wang

Diffusion Models - UDL Notes

Introduction

To Do: Write Introduction

Diffusion models can be interpreted as Hierarchical VAEs. The encoder in this case has no learnable parameters, and gradually adds noise to data. The decoder attempts to reverse this process.

DDPM - Noise Schedule

Let \(\mathbf{X}\) the random variable representing a data sample from the intended distribution, and \(\mathbf{x}\) be a realization of it. Also, define

\[\mathbf{Z_0} = \mathbf{X}\]

We have latent variables \(\mathbf{z_1}, \ldots, \mathbf{z_T}\) correponding to noise levels \(1, \ldots, T\). These latent variables are given by:

\[\mathbf{z_t} = \sqrt{1 - \beta_t }\mathbf{z_{t-1}} + \sqrt{\beta_t}\mathbf{\epsilon}_t\]

where \(\mathbf{\epsilon}_t\sim \mathcal{N}(0,I)\).

Marginal Distributions

Computing \(\mathbf{q}(\mathbf{x}_t \mid \mathbf{x})\) is like adding a noise level corresponding to \(t\) to our data. The formulas above allow us to generate these by iteratively adding noise. However, we can actually directly compute the marginal distributions in closed form:

\[\mathbf{q}(\mathbf{x}_t \mid \mathbf{x})\]

(where we are marginalizing over \(\mathbf{z}_1,\ldots, \mathbf{z}_{t-1}\)).

We can do this by working inductively.

To Do: Insert Inductive Proof.

To summarize, we define

\[\alpha_t = \prod_{i=1}^t (1 - \beta_t)\]

And discover that

\[\mathbf{q}(\mathbf{x}_t \mid \mathbf{x}) = \mathcal{N}(\sqrt{\alpha_t} \mathbf{x}, (1-\alpha_t) \mathbf{I})\]

DDPM - Derivation of Objective

We would like to maximize the log probability of our data under our generative model. We show we can express an ELBO of our objective as a weighted sum of MSE losses from a sample of \(\mathbf{z}_{t-1} \sim p(\mathbf{z}_{t-1} \mid \mathbf{x})\) and a prediction of that sample obtained from a learned function on \(\mathbf{z}_{t}\). In other words, the objective is to predict \(\mathbf{z}_{t-1}\) from \(\mathbf{z}_{t}\), or to denoise the data one step.

This DDPM Math is shown here.

In practice, we express the function \(\mathbf{f}\) as as a linear combination of \(\mathbf{z}_{t-1}\) and a noise term \(\mathbf{\epsilon}_t \sim \mathcal{N}(0,\mathbf{I})\).

We use a neural network \(\mathbf{g_\theta}\) to predict the noise term \(\mathbf{\epsilon}_t \sim \mathcal{N}(0,\mathbf{I})\). This leads to a reparametrization, and a method of training and inference for DDPMs.

Why Does Diffusion Work?

Modeling Multi-modal distributions

Generative modeling is about turning simple distributions (noise) into more complex ones (data).

Plinko

The game Plinko is a good analogy.

When we extend this analogy to diffusion when the number of timesteps is infinite, we get something like Brownian Motion.

Notes on Optimization

More Observations

See math in “DiffusionMath”

Last Reviewed: 1/23/25

More Resources I should look at: Sander Sander2