In this post, I will explain noise contrastive estimation.
Suppose we have observed samples from an unknown data distribution $p_{0}(\mathbf{x})$ and want to learn an energy based model (EBM) as below,
\[\begin{align}\label{eq:ebm} p_{\theta}(\mathbf{x}) = \frac{1}{Z(\theta)} \exp \left( - E(\mathbf{x}, \theta) \right). \end{align}\]$E$ is the energy function parameterized by \(\theta\). $Z(\theta)$ is called the normalization constant or partition function, which is hard to estimate if not impossible. If you want to directly learn the model (i.e., find an optimal $\theta$) via maximum likelihood, then you have to deal with $Z(\theta)$ properly since it is a function of $\theta$. Previously, we talk about learning methods that do not require the exact $Z(\theta)$, including Score Matching, Contrastive Divergence (CD), and Pseudo-Likelihood.
Noise Contrastive Estimation
We first define a noise distribution $p_{n}(\mathbf{x}) = \int p_{n}(\mathbf{x} \vert \mathbf{x}^{\prime}) p_{\theta}(\mathbf{x}^{\prime}) \mathrm{d}\mathbf{x}^{\prime}$.
The log density ratio between data and noise distributions is
\[\begin{align}\label{eq:log_density_ratio} g(\mathbf{x}, \theta) & = \log p_{\theta}(\mathbf{x}) - \log p_{n}(\mathbf{x}) \nonumber \\ & = -E(\mathbf{x}, \theta) -\log Z(\theta) - \log \left( \int p_{n}(\mathbf{x} \vert \mathbf{x}^{\prime}) \frac{\exp \left( - E(\mathbf{x}^{\prime}, \theta) \right)}{Z(\theta)} \mathrm{d}\mathbf{x}^{\prime} \right) \nonumber \\ & = -E(\mathbf{x}, \theta) - \log \left( \int p_{n}(\mathbf{x} \vert \mathbf{x}^{\prime}) \exp \left( - E(\mathbf{x}^{\prime}, \theta) \right) \mathrm{d}\mathbf{x}^{\prime} \right) \nonumber \\ \end{align}\]The learning objective is to maximize the following objective,
\[\begin{align}\label{eq:nce_objective} J(\mathbf{x}, \theta) = \log \sigma \left( g(\mathbf{x}, \theta) \right) - \log (1 - \sigma \left( g(\mathbf{x}, \theta) \right)) \nonumber \end{align}\]The gradient of the objective is,
\[\begin{align}\label{eq:nce_grad} \frac{\nabla J(\mathbf{x}, \theta)}{\nabla \theta} & = \left( \frac{1}{\sigma \left( g(\mathbf{x}, \theta) \right)} + \frac{1}{1 - \sigma \left( g(\mathbf{x}, \theta) \right)} \right) \sigma \left( g(\mathbf{x}, \theta) \right) (1 - \sigma \left( g(\mathbf{x}, \theta) \right)) \frac{\nabla g(\mathbf{x}, \theta)}{\nabla \theta} \nonumber \\ & = \frac{\nabla g(\mathbf{x}, \theta)}{\nabla \theta} \nonumber \\ & = - \frac{\nabla E(\mathbf{x}, \theta)}{\nabla \theta} + \frac{ \int p_{n}(\mathbf{x} \vert \mathbf{x}^{\prime}) \exp \left( - E(\mathbf{x}^{\prime}, \theta) \right) \frac{\nabla E(\mathbf{x}^{\prime}, \theta)}{\nabla \theta} \mathrm{d}\mathbf{x}^{\prime} }{ \int p_{n}(\mathbf{x} \vert \mathbf{x}^{\prime}) \exp \left( - E(\mathbf{x}^{\prime}, \theta) \right) \mathrm{d}\mathbf{x}^{\prime} } \nonumber \\ & = - \frac{\nabla E(\mathbf{x}, \theta)}{\nabla \theta} + \frac{ \int p_{n}(\mathbf{x}, \mathbf{x}^{\prime}) \frac{\nabla E(\mathbf{x}^{\prime}, \theta)}{\nabla \theta} \mathrm{d}\mathbf{x}^{\prime} }{ p_{n}(\mathbf{x}) } \nonumber \\ & = - \frac{\nabla E(\mathbf{x}, \theta)}{\nabla \theta} + \int p_{n}(\mathbf{x}^{\prime} \vert \mathbf{x}) \frac{\nabla E(\mathbf{x}^{\prime}, \theta)}{\nabla \theta} \mathrm{d}\mathbf{x}^{\prime} \nonumber \\ & = - \frac{\nabla E(\mathbf{x}, \theta)}{\nabla \theta} + \mathbb{E}_{p_{n}(\mathbf{x}^{\prime} \vert \mathbf{x})} \left[ \frac{\nabla E(\mathbf{x}^{\prime}, \theta)}{\nabla \theta} \right] \end{align}\]