In this post, I will review and only review the model proposed in the following paper: InfoGAN: Interpretable Representation Learning by Information Maximizing Generative Adversarial Nets.
Summary
First, let us recall the objective of GAN as follows,
\[\begin{align}\label{eq:gan_objective} \min\limits_{G}\max\limits_{D} V(D,G) = \mathop{\mathbb{E}}\limits_{x \sim P_{\text{data}}} \left[ \log D(x) \right ] + \mathop{\mathbb{E}}\limits_{z \sim \text{noise}} \left[ \log \left( 1 - D(G(z)) \right) \right ], \end{align}\]where \(G\) and \(D\) are the generator and the discriminator respectively.
The authors argue that without any constraints, the learned latent variables \(z\) would be highly entangled so that they would not capture the disentangled latent factors. Here the underlying assumption is that many real-world generative models often consist of disentangled latent factors which are semantically meaningful (i.e., explainable) and compact (i.e., fewer parameters).
Therefore, in addition to the typical latent variables \(z\) (which are often Gaussian noises), authors introduce independent latent factors \(\bm{c} = \{c_1, \dots, c_L\}\). The key idea is to add a regularization so that we can maximize the mutual information between the latent factors \(\bm{c}\) and the generated samples. Intuitively, doing this could enforce \(\bm{c}\) to contain information about sample \(x\) so that latent space becomes meaningful. The regularized objective looks like something as following,
\[\begin{align}\label{eq:mutual_info_reg} \min\limits_{G}\max\limits_{D} V(D,G) - \lambda I(\bm{c}; G(z, \bm{c})), \end{align}\]where the sample \(x = G(z, \bm{c})\) and \(\lambda\) is the weight of the regularization.
To compute the mutual information, one needs to know the posterior \(P(c \vert x)\) which is intractable. Therefore, authors resort to an variational approximation as below.
\[\begin{align}\label{eq:mutual_info_vi} I(\bm{c}; G(z, \bm{c})) & = \int \int P(\bm{c}, x) \log \frac{ P(\bm{c}, x) }{ P(\bm{c}) P(x) } \mathrm{d} x \mathrm{d} \bm{c} \nonumber \\ & = \int \int P(x, \bm{c}) \log \frac{ P(\bm{c} \vert x) P(x) }{ P(\bm{c}) P(x) } \mathrm{d} x \mathrm{d} \bm{c} \nonumber \\ & = \int \int P(x, \bm{c}) \log \frac{ P(\bm{c} \vert x) }{ P(\bm{c}) } \mathrm{d} x \mathrm{d} \bm{c} \nonumber \\ & = \int \int P(x, \bm{c}) \log P(\bm{c} \vert x) \mathrm{d} x \mathrm{d} \bm{c} - \int \int P(x, \bm{c}) \log P(\bm{c}) \mathrm{d} x \mathrm{d} \bm{c} \nonumber \\ & = \int \int P(\bm{c} \vert x) P(x) \log P(\bm{c} \vert x) \mathrm{d} x \mathrm{d} \bm{c} + H(\bm{c}) \nonumber \\ & = \int P(x) \int P(\bm{c} \vert x) \log \left( P(\bm{c} \vert x) \frac{ Q(\bm{c} \vert x) }{ Q(\bm{c} \vert x) } \right) \mathrm{d} \bm{c} \mathrm{d} x + H(\bm{c}) \nonumber \\ & = \mathop{\mathbb{E}}\limits_{x \sim P(x)} \left[ \int P(\bm{c} \vert x) \log \frac{ P(\bm{c} \vert x) }{ Q(\bm{c} \vert x) } \mathrm{d} \bm{c} \right] + \mathop{\mathbb{E}}\limits_{x \sim P(x)} \left[ \int P(\bm{c} \vert x) \log Q(\bm{c} \vert x) \mathrm{d} \bm{c} \right] + H(\bm{c}) \nonumber \\ & \ge \mathop{\mathbb{E}}\limits_{x \sim P(x)} \left[ \int P(\bm{c} \vert x) \log Q(\bm{c} \vert x) \mathrm{d} \bm{c} \right] + H(\bm{c}) \nonumber \\ & = \mathop{\mathbb{E}}\limits_{x \sim P(x)} \left[ \mathop{\mathbb{E}}\limits_{\bm{c} \sim P(\bm{c} \vert x)} \left[ \log Q(\bm{c} \vert x) \right] \right] + H(\bm{c}). \end{align}\]Note that \(x = G(z, \bm{c})\) assumes the implicit distribution \(x \sim P(x \vert \bm{c})\) induced by the generator \(G\). \(P(x)\) is the data distribution obtained by marginalizing over the joint of this implicit distribution and the prior. Do not confuse this with \(P_{\text{data}}(x)\).
Up to this point, we have a lower bound on the mutual information. However, it still requires sampling from the posterior \(P(\bm{c} \vert x)\) which is not easy to achieve. Authors further use the following trick to circumvent the problem,
\[\begin{align}\label{eq:mutual_info_vi_final} \mathop{\mathbb{E}}\limits_{x \sim P(x)} \left[ \mathop{\mathbb{E}}\limits_{\bm{c} \sim P(\bm{c} \vert x)} \left[ \log Q(\bm{c} \vert x) \right] \right] & = \int P(x) \int P(\bm{c} \vert x) \log Q(\bm{c} \vert x) \mathrm{d} \bm{c} \mathrm{d} x \nonumber \\ & = \int \int P(\bm{c}, x) \log Q(\bm{c} \vert x) \mathrm{d} \bm{c} \mathrm{d} x \nonumber \\ & = \int \int P(x \vert \bm{c}) P(\bm{c}) \log Q(\bm{c} \vert x) \mathrm{d} \bm{c} \mathrm{d} x \nonumber \\ & = \mathop{\mathbb{E}}\limits_{\bm{c} \sim P(\bm{c})} \left[ \mathop{\mathbb{E}}\limits_{x \sim P(x \vert \bm{c})} \left[ \log Q(\bm{c} \vert x) \right] \right] \nonumber \\ & = L_{I}(G, Q) - H(\bm{c}). \end{align}\]The final objective of InfoGAN is
\[\begin{align}\label{eq:info_gan} \min\limits_{G, Q}\max\limits_{D} V_{\text{InfoGAN}}(D,G,Q) = V(D,G) - \lambda L_{I}(G, Q). \end{align}\]Achievements
-
The overall framework is well-motivated and technically reasonable. Learning disentangled latent representation is one of the core problems in unsupervised learning since it could help improve the explainability and make the model more compact.
-
In the implementation level, the modification to GAN is insignificant since the approximated posterior \(Q(\bm{c} \vert x)\) could be constructed by adding one fully-connected layer on top of the discriminator \(D\). Hence, the computation is almost the same as a regular GAN. Also, the prior \(P(\bm{c})\) is fixed so that \(H(\bm{c})\) could be removed.
-
The link with wake-sleep algorithm is very interesting.
Questions
-
The notation in the original paper is confusing. \(G(z, \bm{c})\) should be a sample drawn from the distribution \(P(x \vert \bm{c})\) (or more precisely \(P(x \vert z, \bm{c})\)), i.e., obtained by passing the latent noise to the generator network. However, \(G(z, \bm{c})\) is regarded as the distribution itself in the paper.
-
The Lemma 5.1 in the original paper seems to be unnecessary to reach the final trick in Eq. (4).