A Brief Exploration to Diffusion Probabilistic Models with Code Implementation

This is the first post in the Paper Research series. In this series I will continue to update some personal study notes on reading papers. This post will introduce the basic work of diffusion probabilistic models (DPM), including the derivation of formulas and simple code verification. The content is mainly from Sohl-Dickstein et al. (2015) and Ho et al. (2020). If you have any suggestions on this post or would like to communicate with me, please leave comments below.
Diffusion Models
What are Diffusion Models? Refer to Weng, (2021):
Diffusion models are inspired by non-equilibrium thermodynamics. They define a Markov chain of diffusion steps to slowly add random noise to data and then learn to reverse the diffusion process to construct desired data samples from the noise.

My personal understanding of Diffusion Models is a framework (Fig. 1) where there are no trainable parameters in the forward process and there are training parameters in the reverse process. And there is no restriction on what type of neural network to use in terms of the distribution that needs to be expressed implicitly in the reverse process.

Forward Process
The forward process is a Markov process. According to Wikipedia, a Markov chain or Markov process is a stochastic model describing a sequence of possible events in which the probability of each event depends only on the state attained in the previous event. Informally, this may be thought of as, “What happens next depends only on the state of affairs now.”
The main goal of the forward process is to gradually convert the data distribution $q(\mathbf{x}_0)$ into an analytically tractable distribution $\pi(\mathbf{y})$ by repeated application of a Markov diffusion kernel $T_\pi(\mathbf{y}|\mathbf{y}'; \beta)$ for $\pi(\mathbf{y})$, where $\beta$ is the diffusion rate, $$\pi(\mathbf{y}) = \int T_\pi(\mathbf{y}|\mathbf{y}'; \beta) \mathrm{d}\mathbf{y}' \tag{1}$$ $$q(\mathbf{x}_t|\mathbf{x}_{t-1}) = T_\pi(\mathbf{x}_t|\mathbf{x}_{t-1}; \beta_t) \tag{2}$$
The forward trajectory (joint distribution), corresponding to starting at the data distribution and performing T steps of diffusion, is thus $$q(\mathbf{x}_0, \mathbf{x}_1, \cdots, \mathbf{x}_T) = q(\mathbf{x}_{(0\cdots T)}) = q(\mathbf{x}_0)\prod_{t=1}^T q(\mathbf{x}_t | \mathbf{x}_{t-1}) \tag{3}$$

Reverse Process
The reverse process also is a Markov process. If we can reverse the above process and sample from $q(\mathbf{x}_{t-1} | \mathbf{x}_t)$, we will be able to recreate the true sample from a Gaussian noise input, $\mathbf{x}_T \sim \mathcal{N}(\mathbf{0}, \mathbf{I})$. Unfortunately, we cannot easily estimate $q(\mathbf{x}_{t-1} | \mathbf{x}_t)$ and there fore we need to learn a model $p_\theta$ to approximate these conditional probabilities in order to run the reverse process. We want $p_\theta(\mathbf{x}_{t-1} | \mathbf{x}_t)$ to approximate $q(\mathbf{x}_{t-1} | \mathbf{x}_t)$ as closely as possible for all t.
The generative distribution $p_\theta(\mathbf{x}_{t-1} | \mathbf{x}_t)$ will be trained to describe the same trajectory (also a joint distribution), but in reverse,
$$p_\theta (\mathbf{x}_T) = \pi(\mathbf{x}_T) \tag{4}$$ $$ p_\theta(\mathbf{x}_0, \mathbf{x}_1, \cdots, \mathbf{x}_T) = p_\theta(\mathbf{x}_{(0\cdots T)}) = p_\theta(\mathbf{x}_T)\prod_{t=1}^T p_\theta(\mathbf{x}_{t-1} | \mathbf{x}_{t}) \tag{5}$$
Generative Model
- The forward trajectory (joint distribution): image to noise.
- The reverse trajectory (joint distribution): noise to image.
- The Model Probability (marginal distribution): the probability the generative model assigns to the data.
The probability the generative model assigns to the data is $$p_\theta(\mathbf{x}_0) = \int \int \cdots \int p_\theta(\mathbf{x}_0, \mathbf{x}_1, \cdots, \mathbf{x}_T) \mathrm{d}\mathbf{x}_1 \mathrm{d}\mathbf{x}_2 \cdots \mathrm{d}\mathbf{x}_T \tag{6}$$ For convenience, we simply denote it as: $$p_\theta(\mathbf{x}_0) = \int \mathrm{d}\mathbf{x}_{(1\cdots T)} \ p_\theta(\mathbf{x}_{(0\cdots T)}) \tag{7}$$
But this integral (7) is intractable! We can handle this integral similarly to some of the ways in VAE. Taking a cue from annealed importance sampling and the Jarzynski equality, we instead evaluate the relative probability of the forward and reverse trajectories, averaged over forward trajectories, $$ \begin{equation*} \begin{split} p_\theta(\mathbf{x}_0) &= \int \mathrm{d}\mathbf{x}_{(1\cdots T)} \ p_\theta(\mathbf{x}_{(0\cdots T)}) \frac{q(\mathbf{x}_{(1\cdots T)} | \mathbf{x}_0)}{q(\mathbf{x}_{(1\cdots T)} | \mathbf{x}_0)} \\ &= \int \mathrm{d}\mathbf{x}_{(1\cdots T)} \ q(\mathbf{x}_{(1\cdots T)} | \mathbf{x}_0) \frac{\color{red} p_\theta(\mathbf{x}_{(0\cdots T)})}{\color{blue} q(\mathbf{x}_{(1\cdots T)} | \mathbf{x}_0)} \\ &= \int \mathrm{d}\mathbf{x}_{(1\cdots T)} \ q(\mathbf{x}_{(1\cdots T)} | \mathbf{x}_0) \frac{\color{red} p_\theta(\mathbf{x}_T) \prod_{t=1}^T p_\theta(\mathbf{x}_{t-1} | \mathbf{x}_{t})}{\color{blue} \frac{q(\mathbf{x}_0, \mathbf{x}_{(1\cdots T)})}{q(\mathbf{x}_0)}} \\ &= \int \mathrm{d}\mathbf{x}_{(1\cdots T)} \ q(\mathbf{x}_{(1\cdots T)} | \mathbf{x}_0) \frac{p_\theta(\mathbf{x}_T) \prod_{t=1}^T p_\theta(\mathbf{x}_{t-1} | \mathbf{x}_{t})}{\color{blue} \frac{q(\mathbf{x}_0) \prod_{t=1}^T q(\mathbf{x}_t | \mathbf{x}_{t-1})}{q(\mathbf{x}_0)}} \\ &= \int \mathrm{d}\mathbf{x}_{(1\cdots T)} \ q(\mathbf{x}_{(1\cdots T)} | \mathbf{x}_0) \frac{p_\theta(\mathbf{x}_T) \prod_{t=1}^T p_\theta(\mathbf{x}_{t-1} | \mathbf{x}_{t})}{\prod_{t=1}^T q(\mathbf{x}_t | \mathbf{x}_{t-1})} \\ &= \int \mathrm{d}\mathbf{x}_{(1\cdots T)} \ q(\mathbf{x}_{(1\cdots T)} | \mathbf{x}_0) p_\theta(\mathbf{x}_T) \prod_{t=1}^T \frac{p_\theta(\mathbf{x}_{t-1} | \mathbf{x}_{t})}{q(\mathbf{x}_t | \mathbf{x}_{t-1})} \\ \end{split} \end{equation*} \tag{8} $$
Model Log Likelihood
We want the estimated data distribution ($p_\theta(\mathbf{x}_0)$) to be as close as possible to the actual data distribution ($q(\mathbf{x}_0)$). So training amounts to maximizing the model log likelihood, $$ \begin{equation*} \begin{split} \mathcal{L} &= \int \mathrm{d}\mathbf{x}_0 \ q(\mathbf{x}_0) \log p_\theta (\mathbf{x}_0) \\ &= \int \mathrm{d}\mathbf{x}_0 \ q(\mathbf{x}_0) { \log \left [ \int \mathrm{d}\mathbf{x}_{(1\cdots T)} q(\mathbf{x}_{(1\cdots T)} | \mathbf{x}_0) p_\theta(\mathbf{x}_T) \prod_{t=1}^T \frac{p_\theta(\mathbf{x}_{t-1} | \mathbf{x}_{t})}{q(\mathbf{x}_t | \mathbf{x}_{t-1})} \right ]} \\ &= \int \mathrm{d}\mathbf{x}_0 \ q(\mathbf{x}_0) {\color{blue} \log \left \{\mathbb{E}_{\mathbf{x}_{(1\cdots T)} \sim q(\mathbf{x}_{(1\cdots T)} | \mathbf{x}_0)} \left [ p_\theta(\mathbf{x}_T) \prod_{t=1}^T \frac{p_\theta(\mathbf{x}_{t-1} | \mathbf{x}_{t})}{q(\mathbf{x}_t | \mathbf{x}_{t-1})} \right ] \right \}} \\ &\geq \int \mathrm{d}\mathbf{x}_0 \ q(\mathbf{x}_0) {\color{blue} \mathbb{E}_{\mathbf{x}_{(1\cdots T)} \sim q(\mathbf{x}_{(1\cdots T)} | \mathbf{x}_0)} \left \{ \log \left [ p_\theta(\mathbf{x}_T) \prod_{t=1}^T \frac{p_\theta(\mathbf{x}_{t-1} | \mathbf{x}_{t})}{q(\mathbf{x}_t | \mathbf{x}_{t-1})} \right ] \right \}} \\ &= \int \mathrm{d}\mathbf{x}_0 \ q(\mathbf{x}_0) \int \mathbb{d}\mathbf{x}_{(1\cdots T)} \ q(\mathbf{x}_{(1\cdots T)} | \mathbf{x}_0) \log \left [ p_\theta(\mathbf{x}_T) \prod_{t=1}^T \frac{p_\theta(\mathbf{x}_{t-1} | \mathbf{x}_{t})}{q(\mathbf{x}_t | \mathbf{x}_{t-1})} \right ] \\ &= \int \mathbb{d}\mathbf{x}_{(0\cdots T)} \ q(\mathbf{x}_0) q(\mathbf{x}_{(1\cdots T)} | \mathbf{x}_0) \log \left [ p_\theta(\mathbf{x}_T) \prod_{t=1}^T \frac{p_\theta(\mathbf{x}_{t-1} | \mathbf{x}_{t})}{q(\mathbf{x}_t | \mathbf{x}_{t-1})} \right ] \\ &= \int \mathbb{d}\mathbf{x}_{(0\cdots T)} \ q(\mathbf{x}_{(0\cdots T)}) \log \left [ p_\theta(\mathbf{x}_T) \prod_{t=1}^T \frac{p_\theta(\mathbf{x}_{t-1} | \mathbf{x}_{t})}{q(\mathbf{x}_t | \mathbf{x}_{t-1})} \right ] \\ \end{split} \end{equation*} \tag{9} $$
The blue part in Eq. (9) provided by Jensen’s inequality as Fig. 5.

So we have the lower bound of $\mathcal{L}$, let’s write it down as $$\mathcal{K} = \int \mathbb{d}\mathbf{x}_{(0\cdots T)} \ q(\mathbf{x}_{(0\cdots T)}) \log \left [ p_\theta(\mathbf{x}_T) \prod_{t=1}^T \frac{p_\theta(\mathbf{x}_{t-1} | \mathbf{x}_{t})}{q(\mathbf{x}_t | \mathbf{x}_{t-1})} \right ] \tag{10}$$
1) Peel off $p_\theta(\mathbf{x}_T)$ in $\mathcal{K}$ as an entropy
We can peel off the contribution from $p_\theta(\mathbf{x}_T)$, and rewrite it as an entropy, $$ \begin{equation*} \begin{split} \mathcal{K} &= \int \mathrm{d}\mathbf{x}_{(0\cdots T)} \ q(\mathbf{x}_{(0\cdots T)}) {\color{blue} \log \left [ p_\theta(\mathbf{x}_T) \prod_{t=1}^T \frac{p_\theta(\mathbf{x}_{t-1} | \mathbf{x}_{t})}{q(\mathbf{x}_t | \mathbf{x}_{t-1})} \right ]} \\ &= \int \mathrm{d}\mathbf{x}_{(0\cdots T)} \ q(\mathbf{x}_{(0\cdots T)}) {\color{blue} \left \{ \log p_\theta(\mathbf{x}_T) + \sum_{t=1}^T \log \left [ \frac{p_\theta(\mathbf{x}_{t-1} | \mathbf{x}_{t})}{q(\mathbf{x}_t | \mathbf{x}_{t-1})} \right ] \right \} } \\ &= \int \mathrm{d}\mathbf{x}_{(0\cdots T)} \ q(\mathbf{x}_{(0\cdots T)}) {\color{blue} \sum_{t=1}^T \log \left [ \frac{p_\theta(\mathbf{x}_{t-1} | \mathbf{x}_{t})}{q(\mathbf{x}_t | \mathbf{x}_{t-1})} \right ]} + \int \mathrm{d}\mathbf{x}_{(0\cdots T)} \ q(\mathbf{x}_{(0\cdots T)}) {\color{blue} \log p_\theta(\mathbf{x}_T)} \\ &= \int \mathrm{d}\mathbf{x}_{(0\cdots T)} \ q(\mathbf{x}_{(0\cdots T)}) { \sum_{t=1}^T \log \left [ \frac{p_\theta(\mathbf{x}_{t-1} | \mathbf{x}_{t})}{q(\mathbf{x}_t | \mathbf{x}_{t-1})} \right ]} + {\color{red} \int \mathrm{d}\mathbf{x}_T \ q(\mathbf{x}_T) \log \underbrace{p_\theta(\mathbf{x}_T)}_{{\normalsize \pi}(\mathbf{x}_T)}} \\ \end{split} \end{equation*} \tag{11} $$
By design, the cross entropy to $\pi(\mathbf{x}_T)$ is constant under our diffusion kernels, and equal to the entropy of $p_\theta(\mathbf{x}_T)$. Therefore, $$ \mathcal{K} = \int \mathrm{d}\mathbf{x}_{(0\cdots T)} \ q(\mathbf{x}_{(0\cdots T)}) \sum_{t=1}^T \log \left [ \frac{p_\theta(\mathbf{x}_{t-1} | \mathbf{x}_{t})}{q(\mathbf{x}_t | \mathbf{x}_{t-1})} \right ] {\color{red} - \ \mathcal{H}_p(\mathbf{x}_T)} \tag{12} $$
2) Remove the edge effect at $t=0$
In order to avoid edge effects, we set the final step of the reverse trajectory to be identical to the corresponding forward diffusion step, $$p_\theta(\mathbf{x}_0 | \mathbf{x}_1) = q(\mathbf{x}_1 | \mathbf{x}_0) \frac{\pi(\mathbf{x}_{0})}{\pi(\mathbf{x}_{1})} = T_\pi(\mathbf{x}_0 | \mathbf{x}_1 ; \beta) \tag{13}$$
We then use this equivalence to remove the contribution of the first time-step in the sum, $$ \begin{equation*} \begin{split} \mathcal{K} &= \int \mathrm{d}\mathbf{x}_{(0\cdots T)} \ q(\mathbf{x}_{(0\cdots T)}) {\color{blue} \sum_{t=1}^T \log \left [ \frac{p_\theta(\mathbf{x}_{t-1} | \mathbf{x}_{t})}{q(\mathbf{x}_t | \mathbf{x}_{t-1})} \right ]} - \mathcal{H}_p(\mathbf{x}_T) \\ &= \int \mathrm{d}\mathbf{x}_{(0\cdots T)} \ q(\mathbf{x}_{(0\cdots T)}) {\color{blue} \sum_{t=2}^T \log \left [ \frac{p_\theta(\mathbf{x}_{t-1} | \mathbf{x}_{t})}{q(\mathbf{x}_t | \mathbf{x}_{t-1})} \right ]} + \underbrace{\int \mathrm{d}\mathbf{x}_{(0\cdots T)} \ q(\mathbf{x}_{(0\cdots T)}) {\color{blue} \log \frac{p_\theta(\mathbf{x}_0 | \mathbf{x}_1)}{q(\mathbf{x}_1 | \mathbf{x}_0)}}}_{ {\large \mathbb{E}}_{\mathbf{x}_{(0\cdots T)} \sim q(\mathbf{x}_{(0\cdots T)})} {\normalsize \log \frac{p_\theta(\mathbf{x}_0 | \mathbf{x}_1)}{q(\mathbf{x}_1 | \mathbf{x}_0)}}} - \mathcal{H}_p(\mathbf{x}_T) \\ &= \int \mathrm{d}\mathbf{x}_{(0\cdots T)} \ q(\mathbf{x}_{(0\cdots T)}) \sum_{t=2}^T \log \left [ \frac{p_\theta(\mathbf{x}_{t-1} | \mathbf{x}_{t})}{q(\mathbf{x}_t | \mathbf{x}_{t-1})} \right ] + \int \mathrm{d}\mathbf{x}_{(0, 1)} \ q(\mathbf{x}_0, \mathbf{x}_1) \log \frac{\color{red} p_\theta(\mathbf{x}_0 | \mathbf{x}_1)}{q(\mathbf{x}_1 | \mathbf{x}_0)} - \mathcal{H}_p(\mathbf{x}_T) \\ &= \int \mathrm{d}\mathbf{x}_{(0\cdots T)} \ q(\mathbf{x}_{(0\cdots T)}) \sum_{t=2}^T \log \left [ \frac{p_\theta(\mathbf{x}_{t-1} | \mathbf{x}_{t})}{q(\mathbf{x}_t | \mathbf{x}_{t-1})} \right ] + \int \mathrm{d}\mathbf{x}_{(0, 1)} \ q(\mathbf{x}_0, \mathbf{x}_1) \log \frac{\color{red} q(\mathbf{x}_1 | \mathbf{x}_0) \pi(\mathbf{x}_0)}{q(\mathbf{x}_1 | \mathbf{x}_0) {\color{red} \pi(\mathbf{x}_1)}} - \mathcal{H}_p(\mathbf{x}_T) \\ &= \int \mathrm{d}\mathbf{x}_{(0\cdots T)} \ q(\mathbf{x}_{(0\cdots T)}) \sum_{t=2}^T \log \left [ \frac{p_\theta(\mathbf{x}_{t-1} | \mathbf{x}_{t})}{q(\mathbf{x}_t | \mathbf{x}_{t-1})} \right ] + {\color{green} \int \mathrm{d}\mathbf{x}_{(0, 1)} \ q(\mathbf{x}_0, \mathbf{x}_1) \log \frac{\pi(\mathbf{x}_0)}{\pi(\mathbf{x}_1)}} - \mathcal{H}_p(\mathbf{x}_T) \\ \end{split} \end{equation*} \tag{14} $$
For ease of presentation, the green part of Eq. (14) is derived separately, $$ \begin{equation*} \begin{split} {\color{green} \int \mathrm{d}\mathbf{x}_{(0, 1)} \ q(\mathbf{x}_0, \mathbf{x}_1) \log \frac{\pi(\mathbf{x}_0)}{\pi(\mathbf{x}_1)}} &= \int \mathrm{d}\mathbf{x}_{(0, 1)} \ q(\mathbf{x}_0, \mathbf{x}_1) \left [ \log \pi(\mathbf{x}_0) - \log \pi(\mathbf{x}_1) \right ] \\ &= \int \mathrm{d}\mathbf{x}_{(0, 1)} \ q(\mathbf{x}_0, \mathbf{x}_1) \log \pi(\mathbf{x}_0) - \int \mathrm{d}\mathbf{x}_{(0, 1)} \ q(\mathbf{x}_0, \mathbf{x}_1) \log \pi(\mathbf{x}_1) \\ &= {\color{red} \int \mathrm{d}\mathbf{x}_{0} \ q(\mathbf{x}_{0}) \log \pi(\mathbf{x}_0)} - {\color{red} \int \mathrm{d}\mathbf{x}_{1} \ q(\mathbf{x}_{1}) \log \pi(\mathbf{x}_1)} \\ &= {\color{red} \mathcal{H}_p(\mathbf{x}_T)} - {\color{red} \mathcal{H}_p(\mathbf{x}_T)} \\ &= 0 \\ \end{split} \end{equation*} \tag{15} $$
where we again used the fact that by design $\color{red} -\int \mathrm{d}\mathbf{x}_t \ q(\mathbf{x}_t) \log \pi(\mathbf{x}_t) = \mathcal{H}_p(\mathbf{x}_T)$ is a constant for all $t$.
Therefore, the lower bound in Eq. (14) becomes $$\mathcal{K} = \int \mathrm{d}\mathbf{x}_{(0\cdots T)} \ q(\mathbf{x}_{(0\cdots T)}) \sum_{t=2}^T \log \left [ \frac{p_\theta(\mathbf{x}_{t-1} | \mathbf{x}_{t})}{q(\mathbf{x}_t | \mathbf{x}_{t-1})} \right ] - \mathcal{H}_p(\mathbf{x}_T) \tag{16}$$
3) Rewrite in terms of $q(\mathbf{x}_{t-1} | \mathbf{x}_t)$
Because the forward trajectory is a Markov process, $$ \begin{equation*} q(\mathbf{x}_t | \mathbf{x}_{t-1}) = \left \{ \begin{matrix} q(\mathbf{x}_t | \mathbf{x}_{t-1}, \mathbf{x}_0) & , t > 1 \\ q(\mathbf{x}_1 | \mathbf{x}_{0}, \mathbf{x}_0) = q(\mathbf{x}_1 | \mathbf{x}_{0}) & , t = 1 \end{matrix} \right . \end{equation*} \tag{17} $$
$$ \mathcal{K} = \int \mathrm{d}\mathbf{x}_{(0\cdots T)} \ q(\mathbf{x}_{(0\cdots T)}) \sum_{t=2}^T \log \left [ \frac{p_\theta(\mathbf{x}_{t-1} | \mathbf{x}_{t})}{\color{blue} q(\mathbf{x}_t | \mathbf{x}_{t-1}, \mathbf{x}_0)} \right ] - \mathcal{H}_p(\mathbf{x}_T) \tag{18} $$
Using Bayes’ rule we can rewrite this in terms of a posterior and marginals from the forward trajectory, $$ \mathcal{K} = \int \mathrm{d}\mathbf{x}_{(0\cdots T)} \ q(\mathbf{x}_{(0\cdots T)}) \sum_{t=2}^T \log \left [ \frac{p_\theta(\mathbf{x}_{t-1} | \mathbf{x}_{t})}{\color{blue} q(\mathbf{x}_{t-1} | \mathbf{x}_{t}, \mathbf{x}_0)} \frac{\color{blue} q(\mathbf{x}_{t-1} | \mathbf{x}_0)}{\color{blue} q(\mathbf{x}_{t} | \mathbf{x}_0)} \right ] - \mathcal{H}_p(\mathbf{x}_T) \tag{19} $$
4) Rewrite in terms of KL divergences and entropies
$$ \begin{equation*} \begin{split} \mathcal{K} &= \int \mathrm{d}\mathbf{x}_{(0\cdots T)} \ q(\mathbf{x}_{(0\cdots T)}) \sum_{t=2}^T \log \left [ \frac{p_\theta(\mathbf{x}_{t-1} | \mathbf{x}_{t})}{q(\mathbf{x}_{t-1} | \mathbf{x}_{t}, \mathbf{x}_0)} \frac{q(\mathbf{x}_{t-1} | \mathbf{x}_0)}{q(\mathbf{x}_{t} | \mathbf{x}_0)} \right ] - \mathcal{H}_p(\mathbf{x}_T) \\ &= \int \mathrm{d}\mathbf{x}_{(0\cdots T)} \ q(\mathbf{x}_{(0\cdots T)}) \sum_{t=2}^T \left [ \log \frac{p_\theta(\mathbf{x}_{t-1} | \mathbf{x}_{t})}{q(\mathbf{x}_{t-1} | \mathbf{x}_{t}, \mathbf{x}_0)} + \log \frac{q(\mathbf{x}_{t-1} | \mathbf{x}_0)}{q(\mathbf{x}_{t} | \mathbf{x}_0)} \right ] - \mathcal{H}_p(\mathbf{x}_T) \\ &= \int \mathrm{d}\mathbf{x}_{(0\cdots T)} \ q(\mathbf{x}_{(0\cdots T)}) \sum_{t=2}^T \log \frac{p_\theta(\mathbf{x}_{t-1} | \mathbf{x}_{t})}{q(\mathbf{x}_{t-1} | \mathbf{x}_{t}, \mathbf{x}_0)} \\ &\quad + {\color{green} \int \mathrm{d}\mathbf{x}_{(0\cdots T)} \ q(\mathbf{x}_{(0\cdots T)}) \sum_{t=2}^T \log \frac{q(\mathbf{x}_{t-1} | \mathbf{x}_0)}{q(\mathbf{x}_{t} | \mathbf{x}_0)}} - \mathcal{H}_p(\mathbf{x}_T) \\ \end{split} \end{equation*} \tag{20} $$
For ease of presentation, the green part of Eq. (20) is derived separately, $$ \begin{equation*} \begin{split} {\color{green} \int \mathrm{d}\mathbf{x}_{(0\cdots T)}} &\ {\color{green} q(\mathbf{x}_{(0\cdots T)}) \sum_{t=2}^T \log \frac{q(\mathbf{x}_{t-1} | \mathbf{x}_0)}{q(\mathbf{x}_{t} | \mathbf{x}_0)}} \\ &= \int \mathrm{d}\mathbf{x}_{(0\cdots T)} \ q(\mathbf{x}_{(0\cdots T)}) \log {\color{blue} \prod_{t=2}^T \frac{q(\mathbf{x}_{t-1} | \mathbf{x}_0)}{q(\mathbf{x}_{t} | \mathbf{x}_0)}} \\ &= \int \mathrm{d}\mathbf{x}_{(0\cdots T)} \ q(\mathbf{x}_{(0\cdots T)}) \log {\color{blue} \frac{q(\mathbf{x}_{1} | \mathbf{x}_0)}{q(\mathbf{x}_{T} | \mathbf{x}_0)}} \\ &= \int \mathrm{d}\mathbf{x}_{(0\cdots T)} \ q(\mathbf{x}_{(0\cdots T)}) \left [ \log q(\mathbf{x}_{1} | \mathbf{x}_0) - \log q(\mathbf{x}_{T} | \mathbf{x}_0) \right ] \\ &= \int \mathrm{d}\mathbf{x}_{(0\cdots T)} \ q(\mathbf{x}_{(0\cdots T)}) \log q(\mathbf{x}_{1} | \mathbf{x}_0) - \int \mathrm{d}\mathbf{x}_{(0\cdots T)} \ q(\mathbf{x}_{(0\cdots T)}) \log q(\mathbf{x}_{T} | \mathbf{x}_0) \\ &= {\color{red} \int \mathrm{d}\mathbf{x}_{(0,1)} \ q(\mathbf{x}_0, \mathbf{x}_1) \log q(\mathbf{x}_{1} | \mathbf{x}_0)} - {\color{red} \int \mathrm{d}\mathbf{x}_{(0,T)} \ q(\mathbf{x}_0, \mathbf{x}_T) \log q(\mathbf{x}_{T} | \mathbf{x}_0)} \\ &= {\color{red} \mathcal{H}_q(\mathbf{x}_T | \mathbf{x}_0)} - {\color{red} \mathcal{H}_q(\mathbf{x}_1 | \mathbf{x}_0)} \quad ; \text{(conditional entropy)} \end{split} \end{equation*} \tag{21} $$
Therefore, the lower bound in Eq. (20) becomes $$ \mathcal{K} = {\color{brown} \int \mathrm{d}\mathbf{x}_{(0\cdots T)} \ q(\mathbf{x}_{(0\cdots T)}) \sum_{t=2}^T \log \frac{p_\theta(\mathbf{x}_{t-1} | \mathbf{x}_{t})}{q(\mathbf{x}_{t-1} | \mathbf{x}_{t}, \mathbf{x}_0)}} + \mathcal{H}_q(\mathbf{x}_T | \mathbf{x}_0) - \mathcal{H}_q(\mathbf{x}_1 | \mathbf{x}_0) - \mathcal{H}_p(\mathbf{x}_T) \tag{22} $$
For ease of presentation, the brown part of Eq. (22) is derived separately, $$ \begin{equation*} \begin{split} {\color{brown} \int \mathrm{d}\mathbf{x}_{(0\cdots T)}} & \ {\color{brown} q(\mathbf{x}_{(0\cdots T)}) \sum_{t=2}^T \log \frac{p_\theta(\mathbf{x}_{t-1} | \mathbf{x}_{t})}{q(\mathbf{x}_{t-1} | \mathbf{x}_{t}, \mathbf{x}_0)}} \\ &= \sum_{t=2}^T \int \mathrm{d}\mathbf{x}_{(0\cdots T)} \ q(\mathbf{x}_{(0\cdots T)}) \log \frac{p_\theta(\mathbf{x}_{t-1} | \mathbf{x}_{t})}{q(\mathbf{x}_{t-1} | \mathbf{x}_{t}, \mathbf{x}_0)} \\ &= \sum_{t=2}^T \int \mathrm{d}\mathbf{x}_{0}\mathrm{d}\mathbf{x}_{t-1}\mathrm{d}\mathbf{x}_{t} \ {\color{blue} q(\mathbf{x}_0, \mathbf{x}_{t-1}, \mathbf{x}_t)} \log \frac{p_\theta(\mathbf{x}_{t-1} | \mathbf{x}_{t})}{q(\mathbf{x}_{t-1} | \mathbf{x}_{t}, \mathbf{x}_0)} \\ &= \sum_{t=2}^T \int \mathrm{d}\mathbf{x}_{0}\mathrm{d}\mathbf{x}_{t-1}\mathrm{d}\mathbf{x}_{t} \ {\color{blue} q(\mathbf{x}_0, \mathbf{x}_t) q(\mathbf{x}_{t-1}| \mathbf{x}_t, \mathbf{x}_0)} \log \frac{p_\theta(\mathbf{x}_{t-1} | \mathbf{x}_{t})}{q(\mathbf{x}_{t-1} | \mathbf{x}_{t}, \mathbf{x}_0)} \\ &= \sum_{t=2}^T \int \mathrm{d}\mathbf{x}_{0}\mathrm{d}\mathbf{x}_{t} \ q(\mathbf{x}_0, \mathbf{x}_t) \underbrace{\color{red} \left \{ \int \mathrm{d}\mathbf{x}_{t-1} \ q(\mathbf{x}_{t-1}| \mathbf{x}_t, \mathbf{x}_0) \log \frac{p_\theta(\mathbf{x}_{t-1} | \mathbf{x}_{t})}{q(\mathbf{x}_{t-1} | \mathbf{x}_{t}, \mathbf{x}_0)} \right \} }_{ \begin{array}{c} \small \text{KL Divergence (also called relative entropy)} \\ {\color{violet} \mathcal{D}_{KL}(P \| Q) = \int_{-\infty}^{+\infty} p(x) \log \frac{p(x)}{q(x)} \mathrm{d}x} \end{array} } \\ &= {\color{red} -} \sum_{t=2}^T \int \mathrm{d}\mathbf{x}_{0}\mathrm{d}\mathbf{x}_{t} \ q(\mathbf{x}_0, \mathbf{x}_t) {\color{red} \mathcal{D}_{KL}(q(\mathbf{x}_{t-1} | \mathbf{x}_t, \mathbf{x}_0) \| p_\theta(\mathbf{x}_{t-1} | \mathbf{x}_t))} \end{split} \end{equation*} \tag{23} $$
Therefore, the lower bound in Eq. (22) becomes $$ \begin{equation*} \begin{split} \mathcal{K} = &- \sum_{t=2}^T \int \mathrm{d}\mathbf{x}_{0}\mathrm{d}\mathbf{x}_{t} \ q(\mathbf{x}_0, \mathbf{x}_t) \mathcal{D}_{KL}(q(\mathbf{x}_{t-1} | \mathbf{x}_t, \mathbf{x}_0) \| p_\theta(\mathbf{x}_{t-1} | \mathbf{x}_t)) \\ &+ \mathcal{H}_q(\mathbf{x}_T | \mathbf{x}_0) - \mathcal{H}_q(\mathbf{x}_1 | \mathbf{x}_0) - \mathcal{H}_p(\mathbf{x}_T) \end{split} \end{equation*} \tag{24} $$
Note that the entropies can be analytically computed, and the KL divergence can be analytically computed given $\mathbf{x}_0$ and $\mathbf{x}_t$.
Training consists of finding the reverse Markov transitions which maximize this lower bound on the log likelihood, $$ p_\theta(\mathbf{x}_{t-1} | \mathbf{x}_t) = \argmax_{\theta} \mathcal{K} \tag{25} $$
Specific Diffusion Kernel
Forward Process
Specify that the Markov diffusion kernel in Eq. (2) is subject to a Gaussian distribution, $$ q(\mathbf{x}_t | \mathbf{x}_{t-1}) = T_\pi(\mathbf{x}_t | \mathbf{x}_{t-1}; \beta_t) = \mathcal{N}(\mathbf{x}_t; \sqrt{1 - \beta_t} \mathbf{x}_{t-1}, \beta_t \mathbf{I}) \tag{26} $$
A nice property of the above process is that we can sample $\mathbf{x}_t$ at any arbitrary time step $t$ in a closed form using reparameterization trick. Let $\alpha_t = 1 - \beta_t$ and $\bar{\alpha}_t = \prod_{i=1}^t \alpha_i$ : $$ \begin{equation*} \begin{split} \mathbf{x}_t &= \sqrt{\alpha_t} {\color{blue} \mathbf{x}_{t-1}} + \sqrt{1 - \alpha_t} \bm{\epsilon}_{t-1} \\ &= \sqrt{\alpha_t} {\color{blue} (\sqrt{\alpha_{t-1}} \mathbf{x}_{t-2} + \sqrt{1 - \alpha_{t-1}} \bm{\epsilon}_{t-2})} + \sqrt{1 - \alpha_t} \bm{\epsilon}_{t-1} \\ &= \sqrt{\alpha_t \alpha_{t-1}} \mathbf{x}_{t-2} + {\color{red} \sqrt{\alpha_t - \alpha_t \alpha_{t-1}} \bm{\epsilon}_{t-2} + \sqrt{1 - \alpha_t} \bm{\epsilon}_{t-1}} \\ &= \sqrt{\alpha_t \alpha_{t-1}} \mathbf{x}_{t-2} + {\color{red} \sqrt{\sqrt{\alpha_t - \alpha_t \alpha_{t-1}}^2 + \sqrt{1 - \alpha_t}^2} \bar{\bm{\epsilon}}_{t-2}} \\ &= \sqrt{\alpha_t \alpha_{t-1}} \mathbf{x}_{t-2} + \sqrt{1 - \alpha_t \alpha_{t-1}} \bar{\bm{\epsilon}}_{t-2} \\ &= \cdots \\ &= \sqrt{\bar{\alpha}_t} \mathbf{x}_{0} + \sqrt{1 - \bar{\alpha}_t} \bar{\bm{\epsilon}}_0 \\ &= \sqrt{\bar{\alpha}_t} \mathbf{x}_{0} + \sqrt{1 - \bar{\alpha}_t} {\color{green} \bm{\epsilon}_{t}} \quad \text{; to correspond to the subscript of } \mathbf{x}_t \\ \end{split} \end{equation*} \tag{27} $$ where ${\color{green} \bm{\epsilon}_{t}}, \bm{\epsilon}_{t-1}, \bm{\epsilon}_{t-2}, \cdots \sim \mathcal{N}(\mathbf{0}, \mathbf{I})$.
Recall the red part of Eq. (27) when we merge two Gaussians with different variance, $\mathcal{N}(\mathbf{0}, \sigma_1^2\mathbf{I})$ and $\mathcal{N}(\mathbf{0}, \sigma_2^2\mathbf{I})$, the new distribution is $\mathcal{N}(\mathbf{0}, (\sigma_1^2 + \sigma_2^2)\mathbf{I})$.
Thus, we have $$ q(\mathbf{x}_t | \mathbf{x}_0) = \mathcal{N}(\mathbf{x}_t; \sqrt{\bar{\alpha}_t} \mathbf{x}_0, (1 - \bar{\alpha}_t) \mathbf{I}) \tag{28} $$
Usually, we can afford a larger update step when the sample gets noisier, so $\beta_1 < \beta_2 < \cdots < \beta_T$ and therefore $\bar{\alpha}_1 > \bar{\alpha}_2 > \cdots > \bar{\alpha}_T$.
Reverse Process
According to Eq. (26), $$ p_\theta(\mathbf{x}_{t-1} | \mathbf{x}_t) = \mathcal{N}(\mathbf{x}_{t-1}; \bm{\mu}_\theta(\mathbf{x}_t, t), \bm{\sigma}_\theta(\mathbf{x}_t, t)) \tag{29} $$
It is noteworthy that the reverse conditional probability is tractable when conditioned on $\mathbf{x}_0$: $$ q(\mathbf{x}_{t-1} | \mathbf{x}_t, \mathbf{x}_0) = \mathcal{N}(\mathbf{x}_{t-1}; { \bm{\tilde{\mu}}_t(\mathbf{x}_t, \mathbf{x}_0)}, { \tilde{\beta}_t \mathbf{I}}) \tag{30} $$
Using Bayes’ rule, we have: $$ \begin{equation*} \begin{split} q(\mathbf{x}_{t-1} | \mathbf{x}_t, \mathbf{x}_0) &= q(\mathbf{x}_{t} | \mathbf{x}_{t-1}, \mathbf{x}_0) \frac{q(\mathbf{x}_{t-1} | \mathbf{x}_0)}{q(\mathbf{x}_{t} | \mathbf{x}_0)} \quad ; \text{bringing in Eq. (26) and Eq. (28)} \\ &\propto \exp \left ( -\frac{1}{2}(\frac{(\mathbf{x}_t - \sqrt{\alpha_t} \mathbf{x}_{t-1})^2}{\beta_t}) \right ) \frac{\displaystyle \exp \left ( -\frac{1}{2}( \frac{(\mathbf{x}_{t-1} - \sqrt{\bar{\alpha}_{t-1}} \mathbf{x}_0)^2}{1 - \bar{\alpha}_{t-1}} ) \right ) }{\displaystyle \exp \left ( -\frac{1}{2}( \frac{(\mathbf{x}_{t} - \sqrt{\bar{\alpha}_{t}} \mathbf{x}_0)^2}{1 - \bar{\alpha}_{t}} ) \right ) } \\ &= \exp \left ( -\frac{1}{2} ( \frac{(\mathbf{x}_t - \sqrt{\alpha_t} \mathbf{x}_{t-1})^2}{\beta_t} + \frac{(\mathbf{x}_{t-1} - \sqrt{\bar{\alpha}_{t-1}} \mathbf{x}_0)^2}{1 - \bar{\alpha}_{t-1}} - \frac{(\mathbf{x}_{t} - \sqrt{\bar{\alpha}_{t}} \mathbf{x}_0)^2}{1 - \bar{\alpha}_{t}} ) \right ) \\ &= \exp \left ( -\frac{1}{2}( \frac{ \mathbf{x}_t^2 - 2\sqrt{\alpha_t} \mathbf{x}_t {\color{blue} \mathbf{x}_{t-1}} + \alpha_t {\color{red} \mathbf{x}_{t-1}^2} }{\beta_t} + \frac{ {\color{red} \mathbf{x}_{t-1}^2} -2\sqrt{\bar{\alpha}_{t-1}}\mathbf{x}_0 {\color{blue} \mathbf{x}_{t-1}} + \bar{\alpha}_{t-1} \mathbf{x}_0^2 }{1 - \bar{\alpha}_{t-1}} \right. \\ &\quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \left. - \ \frac{(\mathbf{x}_{t} - \sqrt{\bar{\alpha}_{t}} \mathbf{x}_0)^2}{1 - \bar{\alpha}_{t}} ) \right ) \\ &= \exp \left ( -\frac{1}{2} {\Large (} (\frac{\alpha_t}{\beta_t} + \frac{1}{1 - \bar{\alpha}_{t-1}}) {\color{red} \mathbf{x}_{t-1}^2} - (\frac{2\sqrt{\alpha_t}}{\beta_t} \mathbf{x}_t + \frac{2\sqrt{\bar{\alpha}_{t-1}}}{1 - \bar{\alpha}_{t-1}} \mathbf{x}_t) {\color{blue} \mathbf{x}_{t-1}} + C(\mathbf{x}_t, \mathbf{x}_0) {\Large )} \right ) \end{split} \end{equation*} \tag{31} $$ where $C(\mathbf{x}_t, \mathbf{x}_0)$ is some function not involving $\mathbf{x}_{t-1}$ and details are omitted.
Following the general form of $\mathcal{N}(\mu, \sigma^2)$ probability density function $f(x) = \frac{1}{\sigma \sqrt{2 \pi}} \exp \left ( -\frac{1}{2} (\frac{x - \mu}{\sigma})^2 \right )$, $$ (\frac{\alpha_t}{\beta_t} + \frac{1}{1 - \bar{\alpha}_{t-1}}) {\color{red} \mathbf{x}_{t-1}^2} - (\frac{2\sqrt{\alpha_t}}{\beta_t} \mathbf{x}_t + \frac{2\sqrt{\bar{\alpha}_{t-1}}}{1 - \bar{\alpha}_{t-1}} \mathbf{x}_t) {\color{blue} \mathbf{x}_{t-1}} + C(\mathbf{x}_t, \mathbf{x}_0) = (\frac{x-\mu}{\sigma})^2 = \frac{{\color{red} x^2} - 2\mu {\color{blue} x} + \mu^2}{\sigma^2} \tag{32} $$
The variance $(\tilde{\beta}_t \mathbf{I})$ and mean $(\bm{\tilde{\mu}}_t(\mathbf{x}_t, \mathbf{x}_0))$ in Eq. (30) can be parameterized as follows: $$ \begin{equation*} \begin{split} \tilde{\beta}_t = 1 / (\frac{\alpha_t}{\beta_t} + \frac{1}{1 - \bar{\alpha}_{t-1}}) = \frac{1 - \bar{\alpha}_{t-1}}{1 - \bar{\alpha}_{t}} \beta_t \end{split} \end{equation*} \tag{33} $$
$$ \begin{equation*} \begin{split} \bm{\tilde{\mu}}_t(\mathbf{x}_t, \mathbf{x}_0) &= \frac{(\frac{2\sqrt{\alpha_t}}{\beta_t} \mathbf{x}_t + \frac{2\sqrt{\bar{\alpha}_{t-1}}}{1 - \bar{\alpha}_{t-1}} \mathbf{x}_t) \tilde{\beta}_t }{-2} = \frac{\sqrt{\alpha}_t (1 - \bar{\alpha}_{t-1})}{1 - \bar{\alpha}_t} \mathbf{x}_t + \frac{\sqrt{\bar{\alpha}_{t-1}} \beta_t}{1 - \bar{\alpha}_t} \mathbf{x}_0 \end{split} \end{equation*} \tag{34} $$
Thanks to the nice property, we can represent Eq. (27) to $\mathbf{x}_0 = (\mathbf{x}_t - \sqrt{1 - \bar{\alpha}_t} \bm{\epsilon}_t) / \sqrt{\bar{\alpha}_t}$ and bring it into Eq. (34), $$ \begin{equation*} \begin{split} \bm{\mu}_t(\mathbf{x}_t) &= \bm{\tilde{\mu}}_t\left (\mathbf{x}_t, (\mathbf{x}_t - \sqrt{1 - \bar{\alpha}_t} \bm{\epsilon}_t) / \sqrt{\bar{\alpha}_t} \right ) \\ &= \frac{\sqrt{\alpha}_t (1 - \bar{\alpha}_{t-1})}{1 - \bar{\alpha}_t} \mathbf{x}_t + \frac{\sqrt{\bar{\alpha}_{t-1}} \beta_t}{1 - \bar{\alpha}_t} (\frac{(\mathbf{x}_t - \sqrt{1 - \bar{\alpha}_t} \bm{\epsilon}_t)}{\sqrt{\bar{\alpha}_t}}) \\ &= {\color{red} \frac{1}{\sqrt{\alpha_t}} \left ( \mathbf{x}_t - \frac{1 - \alpha_t}{\sqrt{1 - \bar{\alpha}_t}} \bm{\epsilon}_t \right ) } \end{split} \end{equation*} \tag{35} $$
Loss Function
We define the lower bound of the negative log likelihood as the variational lower bound loss function, $$ \begin{equation*} \begin{split} \mathcal{L}_{VLB} &= - \mathcal{K} \\ &= \sum_{t=2}^T \int \mathrm{d}\mathbf{x}_{0}\mathrm{d}\mathbf{x}_{t} \ q(\mathbf{x}_0, \mathbf{x}_t) {\color{blue} \mathcal{D}_{KL}(q(\mathbf{x}_{t-1} | \mathbf{x}_t, \mathbf{x}_0) \| p_\theta(\mathbf{x}_{t-1} | \mathbf{x}_t))} \\ &\quad - \mathcal{H}_q(\mathbf{x}_T | \mathbf{x}_0) + \mathcal{H}_q(\mathbf{x}_1 | \mathbf{x}_0) + \mathcal{H}_p(\mathbf{x}_T) \\ &= \sum_{t=2}^T {\Large \mathbb{E}}_{\mathbf{x}_0, \mathbf{x}_t \sim q(\mathbf{x}_0, \mathbf{x}_t)} {\Large [} \underbrace{\color{blue} \mathcal{D}_{KL}(q(\mathbf{x}_{t-1} | \mathbf{x}_t, \mathbf{x}_0) \| p_\theta(\mathbf{x}_{t-1} | \mathbf{x}_t))}_{\color{blue} \mathcal{L}_{t-1}} {\Large ]} - \mathcal{H}_q(\mathbf{x}_T | \mathbf{x}_0) + \mathcal{H}_q(\mathbf{x}_1 | \mathbf{x}_0) + \mathcal{H}_p(\mathbf{x}_T) \end{split} \end{equation*} \tag{36} $$
Recall that we need to learn a model to approximate the conditioned probability distributions in the reverse diffusion process, $p_\theta(\mathbf{x}_{t-1} | \mathbf{x}_t) = \mathcal{N}(\mathbf{x}_{t-1}; \bm{\mu}_\theta(\mathbf{x}_t, t), \bm{\sigma}_\theta(\mathbf{x}_t, t))$. We would like to train $\bm{\mu}_\theta(\mathbf{x}_t, t)$ to predict $\bm{\mu}_t(\mathbf{x}_t)$ in Eq. (35), and set $\bm{\sigma}_\theta(\mathbf{x}_t, t)$ is equal to $\sigma_t^2\mathbf{I}$, where $\sigma_t^2$ is equal to $\tilde{\beta}_t$ in Eq. (33) or $\beta_t$ for simplify. The loss term $\mathcal{L}_{t-1}$ in Eq. (36) is parameterized to minimize the difference from $\bm{\mu}_t(\mathbf{x}_t)$: $$ \begin{equation*} \begin{split} \mathcal{L}_{t-1} &= \mathcal{D}_{KL}(q(\mathbf{x}_{t-1} | \mathbf{x}_t, \mathbf{x}_0) \| p_\theta(\mathbf{x}_{t-1} | \mathbf{x}_t)) \\ &= \int \mathrm{d}\mathbf{x}_{t-1} \ q(\mathbf{x}_{t-1} | \mathbf{x}_t, \mathbf{x}_0) \log \frac{p_\theta(\mathbf{x}_{t-1} | \mathbf{x}_(t))}{q(\mathbf{x}_{t-1} | \mathbf{x}_(t), \mathbf{x}_0)} \\ &= {\Large \mathbb{E}}_{\small \mathbf{x}_{t-1} \sim q(\mathbf{x}_{t-1} | \mathbf{x}_t, \mathbf{x}_0)} \left [ \log \frac{p_\theta(\mathbf{x}_{t-1} | \mathbf{x}_(t))}{q(\mathbf{x}_{t-1} | \mathbf{x}_(t), \mathbf{x}_0)} \right ] \\ &\propto {\Large \mathbb{E}}_{\small \mathbf{x}_0 \sim q(\mathbf{x}_0), \bm{\epsilon}_t \sim \mathcal{N}(\mathbf{0}, \mathbf{I})} \left [ \frac{1}{2\sigma_t^2} \left \| {\color{blue} \bm{\mu}_t(\mathbf{x}_t)} - {\color{red} \bm{\mu}_\theta(\mathbf{x}_t, t)} \right \| ^2 \right ] \\ &= {\Large \mathbb{E}}_{\small \mathbf{x}_0 \sim q(\mathbf{x}_0), \bm{\epsilon}_t \sim \mathcal{N}(\mathbf{0}, \mathbf{I})} \left [ \frac{1}{2\sigma_t^2} \left \| {\color{blue} \frac{1}{\sqrt{\alpha_t}} \left ( \mathbf{x}_t - \frac{1 - \alpha_t}{\sqrt{1 - \bar{\alpha}_t}} \bm{\epsilon}_t \right )} - {\color{red} \bm{\mu}_\theta(\mathbf{x}_t, t)} \right \| ^2 \right ] \end{split} \end{equation*} \tag{37} $$
Because $\mathbf{x}_t$ is available as input at training time, we can reparameterize the Gaussian noise term instead to make it predict $\bm{\epsilon}$ from the input $\mathbf{x}_t$ at time step $t$: $$ \bm{\mu}_\theta(\mathbf{x}_t, t) = \frac{1}{\sqrt{\alpha_t}} \left ( \mathbf{x}_t - \frac{1 - \alpha_t}{\sqrt{1 - \bar{\alpha}}_t} \bm{\epsilon}_\theta(\mathbf{x}_t, t) \right ) \tag{38} $$ where $\bm{\epsilon}_\theta$ is a function approximator (the model) intended to predict $\bm{\epsilon}$ from $\mathbf{x}_t$.
Thus, Eq. (29) can be written as $$ p_\theta(\mathbf{x}_{t-1} | \mathbf{x}_{t}) = \mathcal{N} \left ( \mathbf{x}_{t-1} ; \frac{1}{\sqrt{\alpha_t}} \left ( \mathbf{x}_t - \frac{1 - \alpha_t}{\sqrt{1 - \bar{\alpha}}_t} \bm{\epsilon}_\theta(\mathbf{x}_t, t) \right ) , \tilde{\beta}_t \mathbf{I} \right ) \tag{39} $$
According to Eq. (39), sampling $\mathbf{x}_{t-1} \sim p_\theta(\mathbf{x}_{t-1} | \mathbf{x}_{t})$ is: $$ \mathbf{x}_{t-1} = \frac{1}{\sqrt{\alpha_t}} \left ( \mathbf{x}_t - \frac{1 - \alpha_t}{\sqrt{1 - \bar{\alpha}}_t} \bm{\epsilon}_\theta(\mathbf{x}_t, t) \right ) + \sqrt{\tilde{\beta}_t} \bm{\epsilon}^* \tag{40} $$ where $\bm{\epsilon}^* \sim \mathcal{N}(\mathbf{0}, \mathbf{I})$.
Furthermore, with the parameterization Eq. (38), the $\mathcal{L}_{t-1}$ in Eq. (37) simplifies to: $$ \begin{equation*} \begin{split} \mathcal{L}_{t-1} &= {\Large \mathbb{E}}_{\small \mathbf{x}_0 \sim q(\mathbf{x}_0), \bm{\epsilon}_t \sim \mathcal{N}(\mathbf{0}, \mathbf{I})} \left [ \frac{1}{2\sigma_t^2} \left \| {\color{blue} \frac{1}{\sqrt{\alpha_t}} \left ( \mathbf{x}_t - \frac{1 - \alpha_t}{\sqrt{1 - \bar{\alpha}_t}} \bm{\epsilon}_t \right )} - {\color{red} \bm{\mu}_\theta(\mathbf{x}_t, t)} \right \| ^2 \right ] \\ &= {\Large \mathbb{E}}_{\small \mathbf{x}_0 \sim q(\mathbf{x}_0), \bm{\epsilon}_t \sim \mathcal{N}(\mathbf{0}, \mathbf{I})} \left [ \frac{1}{2\sigma_t^2} \left \| {\color{blue} \frac{1}{\sqrt{\alpha_t}} \left ( \mathbf{x}_t - \frac{1 - \alpha_t}{\sqrt{1 - \bar{\alpha}_t}} \bm{\epsilon}_t \right )} - {\color{red} \frac{1}{\sqrt{\alpha_t}} \left ( \mathbf{x}_t - \frac{1 - \alpha_t}{\sqrt{1 - \bar{\alpha}}_t} \bm{\epsilon}_\theta(\mathbf{x}_t, t) \right )} \right \| ^2 \right ] \\ &= {\Large \mathbb{E}}_{\small \mathbf{x}_0 \sim q(\mathbf{x}_0), \bm{\epsilon}_t \sim \mathcal{N}(\mathbf{0}, \mathbf{I})} \left [ \frac{1}{2\sigma_t^2} \left \| \frac{1}{\sqrt{\alpha_t}} \frac{1 - \alpha_t}{\sqrt{1 - \bar{\alpha}_t}} \left (\bm{\epsilon}_t - \bm{\epsilon}_\theta(\mathbf{x}_t, t) \right ) \right \| ^2 \right ] \\ &= {\Large \mathbb{E}}_{\small \mathbf{x}_0 \sim q(\mathbf{x}_0), \bm{\epsilon}_t \sim \mathcal{N}(\mathbf{0}, \mathbf{I})} \left [ \frac{\beta_t^2}{2 \alpha_t (1 - \alpha_t) \sigma_t^2} \left \| \bm{\epsilon}_t - \bm{\epsilon}_\theta({\color{orange} \mathbf{x}_t}, t) \right \| ^2 \right ] \quad ; \text{bringing in Eq. (27)} \\ &= {\Large \mathbb{E}}_{\small \mathbf{x}_0 \sim q(\mathbf{x}_0), \bm{\epsilon}_t \sim \mathcal{N}(\mathbf{0}, \mathbf{I})} \left [ {\color{green} \frac{\beta_t^2}{2 \alpha_t (1 - \alpha_t) \sigma_t^2}} \left \| \bm{\epsilon}_t - \bm{\epsilon}_\theta({\color{orange} \sqrt{\bar{\alpha}_t}\mathbf{x}_0 + \sqrt{1 - \bar{\alpha}_t} \bm{\epsilon}_t}, t) \right \| ^2 \right ] \\ \end{split} \end{equation*} \tag{41} $$
Simplification
Empirically, Ho et al. (2020) found that training the diffusion model works better with a simplified objective that ignores the weighting term (the green part in Eq. (41)): $$ \mathcal{L}_t^{\text{simple}} = {\Large \mathbb{E}}_{\small \mathbf{x}_0 \sim q(\mathbf{x}_0), \bm{\epsilon}_t \sim \mathcal{N}(\mathbf{0}, \mathbf{I})} \left [ \left \| \bm{\epsilon}_t - \bm{\epsilon}_\theta(\sqrt{\bar{\alpha}_t}\mathbf{x}_0 + \sqrt{1 - \bar{\alpha}_t} \bm{\epsilon}_t, t) \right \| ^2 \right ] \tag{42} $$
Simple Code Implementation
The jupyter notebook is available at GitHub Gist.
You can click the button at the top of the notebook to open it in Colab and run the code for free.
Citation
Cited as:
- Gavin, Sun. (May 2023). A Brief Exploration to Diffusion Probabilistic Models [Blog post]. Retrieved from https://gavinsun0921.github.io/posts/paper-reading-01/.
Or
@online{gavin2023diffusion,
title = {A Brief Exploration to Diffusion Probabilistic Models},
author = {Gavin, Sun},
year = {2023},
month = {May},
url = {\url{https://gavinsun0921.github.io/posts/paper-reading-01/}}
}
References
[1] Jascha Sohl-Dickstein et al. “Deep Unsupervised Learning using Nonequilibrium Thermodynamics.” ICML 2015.
[2] Jonathan Ho et al. “Denoising Diffusion Probabilistic Models.” NeurIPS 2020.
[3] Jiaming Song et al. “Denoising Diffusion Implicit Models.” ICLR 2021.
[4] Alex Nichol & Prafulla Dhariwal. “Improved Denoising Diffusion Probabilistic Models.” ICML 2021.
[5] Lilian Weng. “What are Diffusion Models? Lil’Log.” [Blog post] Lil’Log 2021.
[6] Ayan Das. “An Introduction to Diffusion Probabilistic Models.” [Blog post] Ayan’s Blog 2021.