A Brief Exploration to Variational Autoencoder (VAE) with Code Implementation

2024-02-12

Computer Vision , Diffusion

Paper Research

This is the second post in the Paper Research series. In this series I will continue to update some personal study notes on reading papers. This post will introduce the basic work of variational autoencoder (VAE), including the derivation of formulas and simple code verification.

Autoencoder

Autoencoder is a neural network designed to learn an identity function in an unsupervised way to reconstruct the original input while compressing the data in the process so as to discover a more efficient and compressed representation. The autoencoder was first proposed as a nonlinear generalization of principal components analysis (PCA) in Kramer, (1991). And later promoted by the seminal paper by Hinton & Salakhutdinov, (2006).

It consists of two networks:

Encoder network $g_\phi$: It translates the original high-dimension input into the latent low-dimensional code. The input size is larger than the output size.
Decoder network $f_\theta$: The decoder network recovers the high-dimension data from the latent low-dimensional code. The input size is smaller than the output size.

Fig. 1. Illustration of autoencoder model architecture. (Image source: Weng, 2018)

The encoder network essentially accomplishes the dimensionality reduction, just like how we would use Principal Component Analysis (PCA) or Matrix Factorization (MF) for. In addition, the autoencoder is explicitly optimized for the data reconstruction from the code. A good intermediate representation not only can capture latent variables, but also benefits a full decompression process.

The model contains an encoder function $g_\phi(\cdot)$ parameterized by $\phi$ and a decoder function $f_\theta(\cdot)$ parameterized by $\theta$. The latent low-dimensional code learned for input $\mathbf{x}$ in the bottleneck layer is $\mathbf{z} = g_\phi(\mathbf{x})$ and reconstructed input is $\mathbf{x}' = f_\theta(\mathbf{z}) = f_\theta(g_\phi(\mathbf{x}))$. The parameters $(\phi, \theta)$ are learned together to output a reconstructed data sample same as the original input $\mathbf{x} \approx f_\theta(g_\phi(\mathbf{x}))$.

VAE

The idea of Variational Autoencoder (Kingma & Welling, 2014), short for VAE, is actually less similar to the autoencoder model above, but deeply rooted in the methods of variational bayesian and graphical model.

Fig. 2. The type of directed graphical model under consideration. (Image source: Kingma & Welling, 2014)

Instead of mapping the input into a fixed vector, we want to map it into a distribution. In Fig. 2, solid lines denote the generative model $p_\theta(\mathbf{x} \mid \mathbf{z})$ with the analytically tractable prior distribution $p_\theta(\mathbf{z})$， dashed lines denote the variational approximation $q_\phi(\mathbf{z} \mid \mathbf{x})$ to the intractable posterior distribution $p_\text{data}(\mathbf{z} \mid \mathbf{x})$. The variational parameters $\phi$ are learned jointly with the generative model parameters $\theta$.

Problem Scenario

Suppose our dataset consists of i.i.d. samples $\{ \mathbf{x}_i \in \mathbb{R}^D \} _ {i=1} ^N$ from an unknown data distribution $p_\text{data}(\mathbf{x})$. We wish to represent the distribution $p_\text{data}(\mathbf{x})$ of $\mathbf{x}$ with the help of the latent variable $\mathbf{z}$.

$$ p_\theta(\mathbf{x}) = \int p_\theta(\mathbf{x} \mid \mathbf{z}) p_\theta(\mathbf{z}) \mathrm{d} \mathbf{z} \tag{1} $$

We want $p_\theta(\mathbf{x})$ to approximate $p_\text{data}(\mathbf{x})$, so that (theoretically) we both represent $p_\text{data}(\mathbf{x})$ in terms of latent variable $\mathbf{z}$ and get the generative model $p_\theta(\mathbf{x} \mid \mathbf{z})$, killing two birds with one stone.

Loss Function: ELBO

Assuming that we know the real parameter $\theta^*$ for this distribution. In order to generate a sample that looks like a real data point $\mathbf{x}^{(i)}$, we follow these steps:

Sample a $\mathbf{z}^{(i)}$ from the prior distribution $p_{\theta^*}(\mathbf{z})$.
Generate the $\mathbf{x}^{(i)}$ from the condition distribution (generative model) $p_{\theta^*}(\mathbf{x} \mid \mathbf{z} = \mathbf{z}^{(i)})$.

The optimal parameter $\theta^{*}$ is the one that maximizes the probability of generating real data samples:

$$ \theta^* = \argmax_\theta \prod_{i=1}^n p_\theta(\mathbf{x}^{(i)}) \tag{2} $$

Commonly we use the log probability to convert the product on RHS to a sum:

$$ \theta^* = \argmax_\theta \sum_{i=1}^n \log p_\theta(\mathbf{x}^{(i)}) \tag{3} $$

Unfortunately, the integral of $p_\theta(\mathbf{x})$ in Eq. (1) is not well calculated. Kingma & Welling, (2014) chose to use $q_\phi(\mathbf{z} \mid \mathbf{x})$ to approximate $p_\text{data}(\mathbf{z} \mid \mathbf{x})$. They focused primarily on describing the posterior $p_\text{data}(\mathbf{z} \mid \mathbf{x})$, which is difficult to compute so the EM algorithm could not be applied to this problem. But Su. (2018) gives another idea for approaching: a straightforward joint distribution. First we write out the joint probability distribution of the prior distribution $p_\theta(\mathbf{z})$ and the conditional distribution (generative model) $p_\theta(\mathbf{x} \mid \mathbf{z})$:

$$ p_\theta(\mathbf{x}, \mathbf{z}) = p_\theta(\mathbf{x} \mid \mathbf{z}) p_\theta(\mathbf{z}) \tag{4} $$

We define the joint probability distribution $q_\phi(\mathbf{x}, \mathbf{z})$ based on the data distribution $p_\text{data}(\mathbf{x})$ and the variational approximation $q_\phi(\mathbf{z} \mid \mathbf{x})$ of the posterior distribution.

$$ q_\phi(\mathbf{x}, \mathbf{z}) = p_\phi(\mathbf{z} \mid \mathbf{x}) p_\text{data}(\mathbf{x}) \tag{5} $$

We want these two joint probability distributions to be as close together as possible, so we use KL divergence to measure the distance between these two distributions, and we want their KL divergence to be as small as possible.

$$ \begin{align} D_\text{KL}(q_\phi(\mathbf{x},\mathbf{z}) \| p_\theta(\mathbf{x}, \mathbf{z})) & = \int\int q_\phi(\mathbf{x}, \mathbf{z}) \ln \frac{q_\phi(\mathbf{x}, \mathbf{z})}{p_\theta(\mathbf{x}, \mathbf{z})} \mathrm{d}\mathbf{z} \mathrm{d}\mathbf{x} \\ &= \int\int p_\text{data}(\mathbf{x})q_\phi(\mathbf{z} \mid \mathbf{x}) \ln \frac{p_\text{data}(\mathbf{x})q_\phi(\mathbf{z} \mid \mathbf{x})}{p_\theta(\mathbf{x},\mathbf{z})} \mathrm{d} \mathbf{z} \mathrm{d} \mathbf{x} \\ &= \int p_\text{data}(\mathbf{x}) \left [ \int q_\phi(\mathbf{z} \mid \mathbf{x}) \ln \frac{p_\text{data}(\mathbf{x})q_\phi(\mathbf{z} \mid \mathbf{x})}{p_\theta(\mathbf{x},\mathbf{z})} \mathrm{d}\mathbf{z} \right ] \mathrm{d} \mathbf{x} \\ &= \mathbb{E}_{\mathbf{x} \sim p_\text{data}(\mathbf{x})} \left [ \int q_\phi(\mathbf{z} \mid \mathbf{x}) \ln \frac{p_\text{data}(\mathbf{x})q_\phi(\mathbf{z} \mid \mathbf{x})}{p_\theta(\mathbf{x},\mathbf{z})} \mathrm{d}\mathbf{z} \right ] \\ &= \mathbb{E}_{\mathbf{x} \sim p_\text{data}(\mathbf{x})} \left [ \int q_\phi(\mathbf{z}) \ln p_\text{data}(\mathbf{x}) \mathrm{d} \mathbf{z} + \int q_\phi(\mathbf{z} \mid \mathbf{x}) \ln \frac{q_\phi(\mathbf{z} \mid \mathbf{x})}{p_\theta(\mathbf{x},\mathbf{z})} \mathrm{d}\mathbf{z} \right ] \\ &= \mathbb{E}_{\mathbf{x} \sim p_\text{data}(\mathbf{x})} \left [ \ln p_\text{data}(\mathbf{x}) {\color{blue} \int q_\phi(\mathbf{z} \mid \mathbf{x}) \mathrm{d} \mathbf{z}} \right ] + \mathbb{E}_{\mathbf{x} \sim p_\text{data}(\mathbf{x})} \left [ \int q_\phi(\mathbf{z} \mid \mathbf{x}) \ln \frac{q_\phi(\mathbf{z} \mid \mathbf{x})}{p_\theta(\mathbf{x},\mathbf{z})} \mathrm{d} \mathbf{z} \right ] \\ &= {\color{red} \mathbb{E}_{\mathbf{x} \sim p_\text{data}(\mathbf{x})} \left [ \ln p_\text{data}(\mathbf{x}) \right ]} + \mathbb{E}_{\mathbf{x} \sim p_\text{data}(\mathbf{x})} \left [ \int q_\phi(\mathbf{z} \mid \mathbf{x}) \ln \frac{q_\phi(\mathbf{z} \mid \mathbf{x})}{p_\theta(\mathbf{x},\mathbf{z})} \mathrm{d} \mathbf{z} \right ] \end{align} \tag{6} $$ where the integral of the blue part is equal to 1.

$p_\text{data}(\mathbf{x})$ is the prior over $\mathbf{x}$ determined from the samples $\mathbf{x}_1, \mathbf{x}_2, \cdots, \mathbf{x}_n$. Although we cannot write its expression explicitly, it does exist. So for any particular dataset, the red part in Eq. (6) is a constant.

So the loss function can be written as:

$$ \mathcal{L} = D_\text{KL}(q_\phi(\mathbf{x},\mathbf{z}) \| p_\theta(\mathbf{x}, \mathbf{z})) - {\color{red} C_\text{data}} = \mathbb{E}_{\mathbf{x} \sim p_\text{data}(\mathbf{x})} \left [ \int q_\phi(\mathbf{z} \mid \mathbf{x}) \ln \frac{q_\phi(\mathbf{z} \mid \mathbf{x})}{\color{blue} p_\theta(\mathbf{x},\mathbf{z})} \mathrm{d} \mathbf{z} \right ] \tag{7} $$

Because of the nonnegativity of the KL divergence, our loss function possesses a lower bound $-\mathbb{E}_{\mathbf{x} \sim p_\text{data}(\mathbf{x})} \left [ \ln p_\text{data}(\mathbf{x}) \right ]$.

To obtain the generative model $p_\theta(\mathbf{x} \mid \mathbf{z})$, we write $p_\theta(\mathbf{x} \mid \mathbf{z}) p_\theta(\mathbf{z})$ for the joint probability distribution $p_\theta(\mathbf{x},\mathbf{z})$ of the blue part in Eq. (7):

$$ \begin{align} \mathcal{L} &= \mathbb{E}_{\mathbf{x} \sim p_\text{data}(\mathbf{x})} \left [ \int q_\phi(\mathbf{z} \mid \mathbf{x}) \ln \frac{q_\phi(\mathbf{z} \mid \mathbf{x})}{p_\theta(\mathbf{x} \mid \mathbf{z}) p_\theta(\mathbf{z})} \mathrm{d} \mathbf{z} \right ] \\ &= \mathbb{E}_{\mathbf{x} \sim p_\text{data}(\mathbf{x})} \left [ \int q_\phi(\mathbf{z} \mid \mathbf{x}) \left ( \ln \frac{q_\phi(\mathbf{z} \mid \mathbf{x})}{p_\theta(\mathbf{z})} - \ln p_\theta(\mathbf{x} \mid \mathbf{z}) \right ) \mathrm{d} \mathbf{z} \right ] \\ &= \mathbb{E}_{\mathbf{x} \sim p_\text{data}(\mathbf{x})} \left [ \int q_\phi(\mathbf{z} \mid \mathbf{x}) \ln \frac{q_\phi(\mathbf{z} \mid \mathbf{x})}{p_\theta(\mathbf{z})} \mathrm{d} \mathbf{z} - \int q_\phi(\mathbf{z} \mid \mathbf{x}) \ln p_\theta(\mathbf{x} \mid \mathbf{z}) \mathrm{d} \mathbf{z} \right ] \\ &= \mathbb{E}_{\mathbf{x} \sim p_\text{data}(\mathbf{x})} \left [ D_\text{KL}(q_\phi(\mathbf{z} \mid \mathbf{x}) \| p_\theta(\mathbf{z})) - \mathbb{E}_{\mathbf{z} \sim q_\phi(\mathbf{z} \mid \mathbf{x})} [\ln p_\theta(\mathbf{x} \mid \mathbf{z})] \right ] \end{align} \tag{8} $$

The center bracket of Eq. (8) is the loss function of the VAE. Note that although the loss function in Eq. (8) are composed of two parts, the cannot be viewed as optimization problems in which the two parts are minimized separately.

When $D_\text{KL}(q_\phi(\mathbf{z} \mid \mathbf{x}) \| p_\theta(\mathbf{z}))$ is 0, it shows that there is no difference between the two distributions $q_\phi(\mathbf{z} \mid \mathbf{x})$ and $p_\theta(\mathbf{z})$, i.e., $\mathbf{x}$ and $\mathbf{z}$ they two are both independent of each other, then the process of predicting $\mathbf{x}$ using $\mathbf{z}$ at this point must be inaccurate, i.e., $-\mathbb{E}_{\mathbf{z} \sim q_\phi(\mathbf{z} \mid \mathbf{x})} [\ln p_\theta(\mathbf{x} \mid \mathbf{z})]$ cannot be small.
When $-\mathbb{E}_{\mathbf{z} \sim q_\phi(\mathbf{z} \mid \mathbf{x})} [\ln p_\theta(\mathbf{x} \mid \mathbf{z})]$ is small, it implies that $\mathbb{E}_{\mathbf{z} \sim q_\phi(\mathbf{z} \mid \mathbf{x})} \left [ p_\theta(\mathbf{x} \mid \mathbf{z}) \right ]$ is large, i.e., predicting $\mathbf{x}$ using $\mathbf{z}$ is very accurate, and the relationship between $\mathbf{x}$ and $\mathbf{z}$ will be very strong at this time, i.e., $q_\phi(\mathbf{z} \mid \mathbf{x})$ will not be too random, so $D_\text{KL}(q_\phi(\mathbf{z} \mid \mathbf{x}) \| p_\theta(\mathbf{z}))$ will not be small.

So these two parts of the loss are actually antagonistic to each other, and the loss function cannot be viewed separately, but as a whole. In fact, this is exactly what GANs dream of: having a total metric that can indicate the training process of the generative model. This capability is naturally available in VAE models, and GANs can’t do it until WGAN.

Build the Network

So far, there are three distributions in the loss function in Eq. (8) that we don’t know: $q_\phi(\mathbf{z} \mid \mathbf{x})$, $p_\theta(\mathbf{x} \mid \mathbf{z})$, and $p_\theta(\mathbf{z})$. As for $p_\text{data}(\mathbf{x})$, while we can’t write its expression explicitly, sampling from it is easy to do (samples from the dataset). In order to solve the problem practically, we need to identify the three unknown distributions mentioned above or determine their form.

Fig. 3. Illustration of variational autoencoder model with the multivariate Gaussian assumption.
(Image source: Weng, 2018)

1) latent variable distribution

To facilitate the generative model in sampling the latent variable $\mathbf{z}$ when generating samples, we assume that $\mathbf{z} \sim p_\theta(\mathbf{z}) = \mathcal{N}(\mathbf{0}, \mathbf{I})$.

2) posterior distribution approximation

We assume that $q_\phi(\mathbf{z} \mid \mathbf{x})$ is also multivariate normally distributed (with independent components), with its mean and variance determined by $\mathbf{x}$. The “determination” process is in fact a neural network with parameter $\phi$. $$ \begin{align} q_\phi(\mathbf{z} \mid \mathbf{x}) = \frac{1}{\prod\limits_{i=1}\limits^n \sqrt{2\pi[\pmb{\sigma}_\phi(\mathbf{x}_i)]^2}} \exp ( - \frac{1}{2} \| \frac{\mathbf{z} - \pmb{\mu}_\phi(\mathbf{x}_i)}{\pmb{\sigma}_\phi(\mathbf{x}_i)} \|^2) \end{align} \tag{9} $$

Therefore, the KL divergence part of the loss function in Eq. (8) can be pre-written as a concrete expression by referring to Appendix B in Kingma & Welling, (2014):

$$ \begin{align} D_\text{KL}(q_\phi(\mathbf{z} \mid \mathbf{x}) \| p_\theta(\mathbf{x})) &= \int q_\phi(\mathbf{z} \mid \mathbf{x}) \ln \frac{q_\phi(\mathbf{z} \mid \mathbf{x})}{p_\theta(\mathbf{z})} \mathrm{d} \mathbf{z} \\ &= \int q_\phi(\mathbf{z} \mid \mathbf{x}) (\ln q_\phi(\mathbf{z} \mid \mathbf{x}) - \ln p_\theta(\mathbf{z})) \mathrm{d} \mathbf{z} \\ &= {\color{blue} \int q_\phi(\mathbf{z} \mid \mathbf{x}) \ln q_\phi(\mathbf{z} \mid \mathbf{x}) \mathrm{d} \mathbf{z}} - {\color{red} \int q_\phi(\mathbf{z} \mid \mathbf{x}) \ln p_\theta(\mathbf{z})) \mathrm{d} \mathbf{z}} \end{align} \tag{10} $$

Let $\mathcal{J}$ be the dimensionality of $\mathbf{z}$, and let $\mu_j$ and $\sigma_j$ denote the $j\text{-th}$ element of $\pmb{\mu}_\phi(\mathbf{x})$ and $\pmb{\sigma}_\phi(\mathbf{x})$. The blue part of Eq. (10): $$ \begin{align} {\color{blue} \int q_\phi(\mathbf{z} \mid \mathbf{x}) \ln q_\phi(\mathbf{z} \mid \mathbf{x}) \mathrm{d} \mathbf{z}} &= \int \mathcal{N}(\mathbf{z}; \pmb{\mu}_\phi(\mathbf{x}), \pmb{\sigma}_\phi^2(\mathbf{x})) \ln \mathcal{N}(\mathbf{z}; \pmb{\mu}_\phi(\mathbf{x}), \pmb{\sigma}_\phi^2(\mathbf{x})) \mathrm{d} \mathbf{z} \\ &= - \frac{\mathcal{J}}{2} \ln(2\pi) - \frac{1}{2} \sum_{j=1}^{\mathcal{J}} (\mu_j^2 + \sigma_j^2) \end{align} \tag{11} $$

The red part of Eq. (10): $$ \begin{align} {\color{red} \int q_\phi(\mathbf{z} \mid \mathbf{x}) \ln p_\theta(\mathbf{z})) \mathrm{d} \mathbf{z}} &= \int \mathcal{N}(\mathbf{z}; \pmb{\mu}_\phi(\mathbf{x}), \pmb{\sigma}_\phi^2(\mathbf{x})) \ln \mathcal{N}(\mathbf{z}; \mathbf{0}, \mathbf{I}) \mathrm{d} \mathbf{z} \\ &= \frac{1}{2} \sum_{j=1}^{\mathcal{J}} (1 + \ln \sigma_j^2) \end{align} \tag{12} $$

Thus, Eq. (10) can be written as

$$ D_\text{KL}(q_\phi(\mathbf{z} \mid \mathbf{x}) \| p_\theta(\mathbf{x})) = \frac{1}{2} \sum_{j=1}^{\mathcal{J}} (\mu_j^2 + \sigma_j^2 - \ln \sigma_j^2 - 1) \tag{13} $$

3) generative model approximation

For the distributional assumptions of generative model $p_\theta(\mathbf{x} \mid \mathbf{z})$, Kingma & Welling, (2014) gives two options: Bernoulli distribution or Normal distribution.

3.1) Bernoulli distribution

The Bernoulli distribution is the discrete probability distribution of a random variable which takes the value 1 with probability $\rho$ and the value 0 with probability $1−\rho$. The probability mass function $\operatorname{f}$ of this distribution, over possible outcomes $\xi$, is

$$ \begin{cases} \rho &,\text{if } \xi = 1 \\ 1 - \rho &,\text{if } \xi = 0 \tag{14} \end{cases} $$

So when the generating model $p_\theta(\mathbf{x} \mid \mathbf{z})$ is a Bernoulli distribution, it is only appropriate for the case where $\mathbf{x}$ is a multivariate binary vector, since the binary distribution can only produce 0s and 1s. The mnist dataset we’ll be working on later for a simple code demonstration can be viewed as this case. In this case, we use the neural network $\rho(z)$ to count the parameters $\rho$ and thus obtain

$$ p_\theta(\mathbf{x} \mid \mathbf{z}) = \prod_{k=1}^D \left ( \rho_{(k)}(\mathbf{z}) \right )^{\mathbf{x}_{(k)}} \left ( 1 - \rho_{(k)}(\mathbf{z}) \right )^{1 - \mathbf{x}_{(k)}} \tag{15} $$

where $D$ is the dimensionality of $\mathbf{x}$. Thus, from the preceding equation, we deduce that

$$ -\ln p_\theta(\mathbf{x} \mid \mathbf{z}) = \sum_{k=1}^D \left [ -\mathbf{x}_{(k)} \ln \rho_{(k)}(\mathbf{z}) - (1 - \mathbf{x}_{(k)}) \ln \left (1 - \rho_{(k)}(\mathbf{z}) \right ) \right] \tag{16} $$

This suggests that $\rho(\mathbf{z})$ has to be compressed to between 0 and 1 (e.g., with sigmoid activation), and then cross entropy is used as the loss function, where $\rho(\mathbf{z})$ then plays a role similar to that of a decoder.

3.2) Normal distribution / Gaussian distribution

In statistics, a normal distribution or Gaussian distribution is a type of continuous probability distribution for a real-valued random variable. The general form of its probability density function is

$$ \operatorname{f}(x) = \frac{1}{\sigma\sqrt{2\pi}} \exp \left \{ -\frac{1}{2}(\frac{x-\mu}{\sigma})^2 \right \} \tag{17} $$

When the generated model $p_\theta(\mathbf{x} \mid \mathbf{z})$ is normally distributed, it is suitable for general data. In this case, we use the neural networks $\tilde{\mu}(\mathbf{z})$ and $\tilde{\sigma}(\mathbf{z})$ to compute the mean and variance , which yields $$ p_\theta(\mathbf{x} \mid \mathbf{z}) = \frac{1}{\prod\limits_{k=1}^D \sqrt{2 \pi \tilde{\sigma}_{(k)}^2(\mathbf{z})}} \exp \left ( -\frac{1}{2} \left \| \frac{\mathbf{x} - \tilde{\mu}(\mathbf{z})}{\tilde{\sigma}(\mathbf{z})} \right \| \right ) \tag{18} $$

Thus, from the preceding equation, we deduce that $$ -\ln p_\theta(\mathbf{x} \mid \mathbf{z}) = \frac{1}{2} \left \| \frac{\mathbf{x} - \tilde{\mu}(\mathbf{z})}{\tilde{\sigma}(\mathbf{z})} \right \| + \frac{D}{2} \ln 2\pi + \frac{1}{2} \sum_{k=1}^D \ln \tilde{\sigma}_{(k)}^2(\mathbf{z}) \tag{19} $$

For ease of computation, we usually fix the variance as a constant $\sigma^*$. Then the above equation can be written as

$$ -\ln p_\theta(\mathbf{x} \mid \mathbf{z}) = \frac{1}{2 \sigma^*} \left \| \mathbf{x} - \tilde{\mu}(\mathbf{z}) \right \| + C \tag{20} $$

So this part becomes the MSE loss function.

Sampling Calculation

After the previous efforts, we were finally able to write the loss function Eq. (8) concretely after making assumptions about each of the three as-yet-undetermined distributions $q_\phi(\mathbf{z} \mid \mathbf{x})$, $p_\theta(\mathbf{x} \mid \mathbf{z})$, and $p_\theta(\mathbf{z})$. When $q_\phi(\mathbf{z} \mid \mathbf{x})$ and $p_\theta(\mathbf{z})$ are both Gaussian distributions, the KL divergence portion of Equation (8) results in Equation (13). We also write out the corresponding generative modeling part of the loss when $p_\theta(\mathbf{x} \mid \mathbf{z})$ is either Bernoulli or Gaussian distributed. So we’re now just short of sampling from the model.

The expectation term in the loss function invokes generating samples from $\mathbf{z}\sim q_\phi(\mathbf{z} \mid \mathbf{x})$. Sampling is a stochastic process and therefore we cannot backpropagate the gradient. To make it trainable, the reparameterization trick is introduced: It is often possible to express the random variable $\mathbf{z}$ as a deterministic variable $\mathbf{z} = \mathcal{F}_\phi(\mathbf{x}, \epsilon)$, where $\epsilon$ is an auxiliary independent random variable, and the transformation function $\mathcal{F}_\phi$ parameterized by $\phi$ converts $\epsilon$ to $\mathbf{z}$.
Reparameterization Trick in Lil’Log

Simple Code Implementation

You can open it in Colab and run the code for free.

References

[1] Diederik P. Kingma & Max Welling. “Auto-Encoding Variational Bayes.” ICLR 2014.

[2] Mark A. Kramer. “Nonlinear Principal Component Analysis Using Autoassociative Neural Networks.” AIChE Journal 1991.

[3] Geoffrey E. Hinton & Ruslan R. Salakhutdinov. “Reducing the Dimensionality of Data with Neural Networks.” Science 2006.

[4] Lilian Weng. “From Autoencoder to Beta-VAE.” [Blog post] Lil’Log 2018.

[5] Jianlin Su. “变分自编码器（二）：从贝叶斯观点出发.” [Blog post] Scientific Spaces 2018.