In this post, I will review a popular class of generative models called variational autoencoders (VAEs). Since the literature on this topic is vast, I will only focus on presenting a few points which I think are important. In the future, I will write a few more blogs to explain some of the representative works in detail.
Basics
First, given some data \(X\), we would like to learn its useful representation \(Z\) which are typically assumed to be more compact (or has a lower dimension). In an unsupervised learning setting, a popular choice of objective is the representation \(Z\) should be able to reconstruct the original data which results in a autoencoder. In other words, we want to learn deterministic functions so that \(Z = f_{\text{encoder}}(X)\) and \(X = f_{\text{decoder}}(Z)\).
VAE and its ancestor Helmholtz machine turn the above idea into a probabilistic setting. In particular, we treat the representation \(Z\) as a latent variable, introduce its prior \(p_{\theta}(Z)\), and build probabilistic encoder \(q_{\phi}(Z \vert X)\) and probabilistic decoder \(p_{\theta}(X \vert Z)\). Here we concatenate the parameters of prior and the decoder into a single vector \(\theta\).
Maximum Likelihood
To learn the model parameters \(\phi, \theta\), the first thing came into our mind is to maximize the log likelihood of our observed data \(X\),
\[\begin{align}\label{eq:max_likelihood} \max\limits_{\theta} \log p_{\theta}(X) = \log \int p_{\theta}(X \vert Z) p_{\theta}(Z) \mathrm{d} Z. \end{align}\]As you can see, this objective has nothing to do with \(\phi\). Indeed, if we can tractably deal with this integral, then there is no need to invoke an encoder. Many of the Bayesian statistical models are actually following this flavor and try to come up with tractable approximations to the integral. Variational inference is one important class of approximations and is also the reason why encoder comes into play.
Evidence Lower Bound
In particular, we formulate a lower bound of the log likelihood as follows (there are other ways to arrive at this bound),
\[\begin{align}\label{eq:elbo} \log p_{\theta}(X) & = \log \int p_{\theta}(X \vert Z) p_{\theta}(Z) \mathrm{d} Z \nonumber \\ & = \log \int \frac{p_{\theta}(X \vert Z)p_{\theta}(Z)}{q_{\phi}(Z \vert X)} q_{\phi}(Z \vert X) \mathrm{d} Z \nonumber \\ & \ge \int q_{\phi}(Z \vert X) \log\left( \frac{p_{\theta}(X \vert Z)p_{\theta}(Z)}{q_{\phi}(Z \vert X)} \right) \mathrm{d} Z \nonumber \\ & = \mathbb{E}_{q_{\phi}(Z \vert X)} \left[ \log p_{\theta}(X \vert Z) \right] - \text{KL} \left( q_{\phi}(Z \vert X) \Vert p_{\theta}(Z) \right), \nonumber \\ & = - \left( \text{ReconstructionLoss}\left( q_{\phi}(Z \vert X), p_{\theta}(X \vert Z) \right) + \text{KL} \left( q_{\phi}(Z \vert X) \Vert p_{\theta}(Z) \right) \right), \end{align}\]where we allow the encoder to kick in in the 2nd line and invoke the Jensen’s inequality in the 3rd line. The objective in Eq. (2) is often called evidence lower bound (ELBO). Now the question is why ELBO is more tractable than the likelihood? To see this, we only need to consider non-conjugate family where the integral in Eq. (1) is intractable but we can still evaluate the ELBO (sometimes its Monte Carlo estimation). Now we have a lower bound, we only need to maximize it to learn the model parameters. You might have noticed that the ELBO holds for any single data point \(X\). Since we typically assume data are i.i.d. from \(p_{\text{data}}(X)\), then we can sum both left and right terms for all data points to get the bound for the dataset.
In VAEs, we construct the encoder (the approximated posterior) via a neural network which is shared across all data points in the dataset (why the inference is called amortized inference). The inference is merely an evaluation (or forward pass) of the encoder. Alternatively, one can have a \(q_{\phi} (Z \vert X)\) per data point and directly optimize the ELBO w.r.t. many such \(q_{\phi} (Z \vert X)\) in the mini-batch (note that in this case we have a double-loop optimization where the outer one for learning and the inner one for inference). The latter is arguably more flexible but comes with a higher computational cost (this is called stochastic variational inference (SVI)).
Practices
In practice, people often set \(P(Z) = \mathcal{N}(0, I)\) (i.e., no parameters to learn for the prior) and \(P(X \vert Z) = \mathcal{N}(\mu, \sigma^{2}I)\). For each sample \(z \sim P(Z)\), the decoder distribution parameters \(\mu = f_{\theta}(z)\), \(\sigma = g_{\theta}(z)\) where \(f\) and \(g\) are neural networks. The rationale behind this is that the marginalized data distribution of our generative model \(P_{model}(X) = \int P(X \vert Z) P(Z) \mathrm{d}Z\) now behaves like a infinite mixture model since we can sample infinite many \(z\) and generate a corresponding output distribution \(P(X \vert Z = z)\) for each of them.
Also, people often choose factorized form of posterior \(q_{\phi}(Z \vert X)\) so that it is easy to draw sample from (therefore easier to compute the reconstruction loss) and has analytical form of the KL divergence.
Advancements
Everything looks good so far. VAEs indeed achieve great success in modeling various data. However, they are also known to have some issues as discussed below.
What’s Wrong with ELBO?
The first place where VAE could go wrong is the objective. Does maximizing ELBO really lead to what we want? Let us first revisit our objective. Given the observed data from \(P_{\text{data}}(X)\), we would like to maximize the likelihood as below
\[\begin{align}\label{eq:objective} \max_{\theta} \mathbb{E}_{p_{\text{data}}(X)} \left[ \log p_{\theta}(X) \right] \iff \min_{\theta} \text{KL} & \left( p_{\text{data}}(X) \Vert p_{\theta}(X) \right). \end{align}\]Here we introduce the \(P_{\text{data}}(X)\) to highlight the fact that we are minimizing the KL divergence between the data distribution \(p_{\text{data}}(X)\) and our model distribution \(p_{\theta}(X)\) when we are empirically maximizing the log likelihood of the observed dataset.
Then we do a massage on the ELBO as below, \(\begin{align}\label{eq:elbo_2} \mathbb{E}_{p_{\text{data}}(X)} & \left[ \log p_{\theta}(X) \right] \ge \mathbb{E}_{p_{\text{data}}(X)} \left[ \text{ELBO} \right] \nonumber \\ & = \mathbb{E}_{p_{\text{data}}(X)} \left[ \mathbb{E}_{q_{\phi}(Z \vert X)} \left[ \log p_{\theta}(X \vert Z) \right] \right] - \mathbb{E}_{p_{\text{data}}(X)} \left[ \text{KL} \left( q_{\phi}(Z \vert X) \Vert p_{\theta}(Z) \right) \right]. \end{align}\)
We can further decompose the 2nd term as below, \(\begin{align}\label{eq:KL_decompose} \mathbb{E}_{p_{\text{data}}(X)} & \left[ \text{KL} \left( q_{\phi}(Z \vert X) \Vert p_{\theta}(Z) \right) \right] = \iint p_{\text{data}}(X) q_{\phi}(Z \vert X) \log \frac{q_{\phi}(Z \vert X)}{p_{\theta}(Z)} \mathrm{d}Z \mathrm{d}X \nonumber \\ & = \iint p_{\text{data}}(X) q_{\phi}(Z \vert X) \log \frac{q_{\phi}(Z \vert X) p_{\text{data}}(X) q(Z)}{p_{\theta}(Z) p_{\text{data}}(X) q(Z)} \mathrm{d}Z \mathrm{d}X \nonumber \\ & = \iint p_{\text{data}}(X) q_{\phi}(Z \vert X) \log \frac{q_{\phi}(Z \vert X) p_{\text{data}}(X) }{p_{\text{data}}(X) q(Z)} \mathrm{d}Z \mathrm{d}X \nonumber \\ & + \iint p_{\text{data}}(X) q_{\phi}(Z \vert X) \log \frac{q(Z)}{p_{\theta}(Z)} \mathrm{d}Z \mathrm{d}X \nonumber \\ & = I(X ; Z) + \text{KL} \left( q(Z) \Vert p_{\theta}(Z) \right), \end{align}\)
where \(q(Z) = \int p_{\text{data}}(X) q_{\phi}(Z \vert X) \mathrm{d}X\) (sometimes called aggregated posterior) and $I(x ; z)$ is the mutual information between \(X\) and \(Z\) defined under the distribution \(p_{\text{data}}(X) q_{\phi}(Z \vert X)\).
Now you can see that when we maximize the ELBO we also minimize the mutual information between \(I(X ; Z)\) which will cause the latent variables contain less information about \(X\). Therefore, our encoder tends to ignore the useful information in \(X\) and generate latent code \(Z\) which is independent of \(X\). Moreover, with an independent encoder, one could easily make \(\text{KL} \left( q(Z) \Vert p_{\theta}(Z) \right) = 0\).
On the other hand, if we happen to have a powerful decoder such as an autoregressive model, then maximizing the ELBO w.r.t. the decoder (i.e., the reconstruction loss) could push the decoder to ignore the latent variables sampled from the encoder and learn to generate good samples on its own. As you can see in this case, maximizing ELBO would lead to possibly perfect sample generation but poor representation. Therefore, ELBO does not favor a balance between the good representation learning and good sample quality.
This phenomenon, i.e., the generative model learns to ignore a subset of the latent variables is called posterior collapse (or sometimes called information preference). As we have discussed, both the high capacity decoder and the deficiency of ELBO are blamed for causing this phenomenon. But at its heart, the essential problem is:
Latent variable modeling is generally ill-conditioned (i.e., there are many possible latent variable models which could well explain the observed data), therefore hard to optimize.
Therefore, some sort of regularization is needed to deal with the ill-conditioned problems in general.
One might ask can we just discard the mutual information term and optimize the remaining part in ELBO? There are a few issues with this proposal. First, the resulting objective is not a valid lower bound anymore due to the fact the mutual information is nonnegative. Second, it is not easy to compute \(q(Z)\). But this proposal does inspire a few works in the literature, e.g., InfoVAE discards the mutual information term and uses the sample based divergence in place of the intractable KL divergence.
There are other works which try to add various regularizations. For example, \(\beta\)-VAE anneals the KL term in ELBO. Variational Lossy Autoencoder tries to weaken the decoder in a particular way (e.g., reducing the receptive field size to make the decoder incapable of capturing global image properties) so that the encoder could be forced to learn the (lossy) representation with the desirable property (e.g., modeling the global image property).
WAE drops the mutual information term and proposes to use the adversarial training and maximum mean discrepancy in place of the KL divergence term in Eq.(5). FactorVAE proposes to add a total correlation penalty, i.e., \(\text{KL} \left( q(Z) \Vert \prod_{i=1}^{d} q(Z_i) \right)\), to encourage the dimension-wise disentanglement of the latent variable. Controlled Capacity Beta VAE provides an information-theoretic perspective of VAE by viewing encoder as a set of independent additive white Gaussian noise channels. Under this setting, the KL term in ELBO can be seen as an upper bound on the amount of information that can be transmitted through the latent channels per data sample. Furthermore, they propose to control the capacity of the channel (i.e., adding an equality constraint of the KL and transforming it as a regularization) so that the posterior will not collapse.
Amortization vs. Approximation
Let us revisit the ELBO (another way of deriving it),
\[\begin{align}\label{eq:likelihood_decompose} \log P_{\theta}(X) & = \int q_{\phi}(Z \vert X) \log P_{\theta}(X) \mathrm{d} Z \nonumber \\ & = \int q_{\phi}(Z \vert X) \log \frac{p_{\theta}(X, Z)}{p_{\theta}(Z \vert X)} \mathrm{d} Z \nonumber \\ & = \int q_{\phi}(Z \vert X) \log \frac{p_{\theta}(X, Z)}{q_{\phi}(Z \vert X)} \frac{q_{\phi}(Z \vert X)}{p_{\theta}(Z \vert X)} \mathrm{d} Z \nonumber \\ & = \underbrace{ \int q_{\phi}(Z \vert X) \log \frac{p_{\theta}(X, Z)}{q_{\phi}(Z \vert X)} \mathrm{d} Z }_{\text{ELBO}} + \underbrace{ \int q_{\phi}(Z \vert X) \log \frac{q_{\phi}(Z \vert X)}{p_{\theta}(Z \vert X)} \mathrm{d} Z }_{\text{KL}\left( q_{\phi}(Z \vert X) \Vert p_{\theta}(Z \vert X) \right)}. \end{align}\]Since \(\text{KL} \left( q_{\phi}(Z \vert X) \Vert p_{\theta}(Z \vert X) \right)\) is nonnegative, we know that ELBO is a lower bound. More importantly, we know that when \(q_{\phi}(Z \vert X) = p_{\theta}(Z \vert X)\), ELBO achieves the maximum since the KL term diminishes.
Given a specific family of approximated posterior \(q_{\phi}(Z \vert X)\), the best achievable ELBO can be obtained via the standard stochastic variational inference. Let’s denote the ELBO as \(\mathcal{L}(q_{\phi})\) and the best achievable ELBO as \(\mathcal{L}(q_{\phi}^{\ast})\). Sometimes, \(\text{KL} \left( q_{\phi}^{\ast}(Z \vert X) \Vert p_{\theta}(Z \vert X) \right)\) is called the approximation gap since it tells us the gap between the log likelihood and the best achievable ELBO. Since we perform the amortized inference in VAEs, there is also a gap between best achievable ELBO and the amortized one (sometimes called amortization gap). Formally, we have
\[\begin{align}\label{eq:likelihood_gap} & \log P_{\theta}(X) - \mathcal{L}(q_{\phi}) = \underbrace{ \log P_{\theta}(X) - \mathcal{L}(q_{\phi}^{\ast}) }_{\text{Approximation Gap}} + \underbrace{ \mathcal{L}(q_{\phi}^{\ast}) - \mathcal{L}(q_{\phi}) }_{\text{Amortization Gap}} \nonumber \\ & = \underbrace{ \text{KL} \left( q_{\phi}^{\ast}(Z \vert X) \Vert p_{\theta}(Z \vert X) \right) }_{\text{Approximation Gap}} + \underbrace{ \text{KL} \left( q_{\phi}(Z \vert X) \Vert p_{\theta}(Z \vert X) \right) - \text{KL} \left( q_{\phi}^{\ast}(Z \vert X) \Vert p_{\theta}(Z \vert X) \right) }_{\text{Amortization Gap}}. \end{align}\]There are some empirical study shows that
- Amortization gap is more significant than the approximation gap in practice.
- Increasing the expressiveness of the encoder, e.g., using normalizing flows compared to Gaussian, not only reduces the approximation gap but also reduces the amortization gap.
- The decoder accommodates the choice of the encoder family to reduce the approximation gap (e.g., the true posterior \(p(Z \vert X)\) would be more like Gaussian when you have Gaussian encoders).
- Increasing the expressiveness of the decoder makes the decoder more prone to overfitting and makes the encoder fits better.
Others
Besides above issues, people have tried various other ways to improve VAEs.
Stronger Encoder What if our encoder (or approximated posterior) sucks in terms of expressiveness? As you see in Eq. (7), the approximation gap would be decreased if we have a more expressive encoder. People have propose various alternatives to simple Gaussian, e.g., Inverse Autoregressive Flow, Auxiliary Deep Generative Model.
Stronger Prior In practice, priors of VAEs are typically set to fixed Gaussian. Learning the prior would increase the expressiveness of the model. In particular, from Eq. (5), we can find that if we optimize the ELBO w.r.t. the prior, then the optimal prior is just the aggregated posterior. However, as we have discussed the aggregated posterior is hard to compute and even harder to directly optimize. Therefore, VampPrior advocates using a smaller mixture of encoders defined on learnable pseudo-inputs to approximate the optimal prior (i.e., aggregated posterior). Autoregressive Flow has been used as a prior in Variational Lossy Autoencoder
Other tricks like annealing the entropy of the encoder seems to help as well.