In this post, I will explain score matching and its connection with denoising autoencoder. Suppose we have observed samples from an unknown data distribution $p_{0}(\mathbf{x})$ and want to learn an energy based model (EBM) as below,
\[\begin{align}\label{eq:ebm} p_{\theta}(\mathbf{x}) = \frac{1}{Z(\theta)} \exp \left( - E(\mathbf{x}, \theta) \right). \end{align}\]$E$ is the energy function parameterized by \(\theta\). $Z(\theta)$ is called the normalization constant or partition function, which is hard to estimate if not impossible. If you want to directly learn the model (i.e., find an optimal $\theta$) via maximum likelihood, then you have to deal with $Z(\theta)$ properly since it is a function of $\theta$. Score matching provides an alternative method of learning the model without estimating $Z(\theta)$. Other alternative methods include Contrastive Divergence (CD) and Pseudo-Likelihood which I will talk about in the future.
Score Matching
The score $\psi$ is the derivative of the log density w.r.t. the input example, i.e.,
\[\begin{align}\label{eq:score} \psi(\mathbf{x}, \theta) = \frac{\partial \log p_{\theta}(\mathbf{x}) }{\partial \mathbf{x}} = - \frac{\partial E(\mathbf{x}, \theta)}{\partial \mathbf{x}}. \end{align}\]Note that the “score” here is slightly different from the one used in statistics, i.e., the derivative of the log density w.r.t. the model parameter $\theta$. The objective proposed by Aapo is,
\[\begin{align}\label{eq:score_match} \mathcal{L}_{\text{SM}}(\theta) = \mathbb{E}_{p_{0}(\mathbf{x})} \left[ \frac{1}{2} \left\Vert \psi(\mathbf{x}, \theta) - \frac{\partial \log p_{0}(\mathbf{x}) }{\partial \mathbf{x}} \right\Vert^2 \right]. \end{align}\]Basically, we want to match the model score $\frac{\partial \log p_{\theta}(\mathbf{x}) }{\partial \mathbf{x}}$ to data score $\frac{\partial \log p_{0}(\mathbf{x}) }{\partial \mathbf{x}}$ in the sense of expected squared error. However, this objective is intractable since we do not know the form of $p_{0}(\mathbf{x})$. To circumvent this problem, Aapo proposed a tractable alternative objective (we call it implicit score matching (ISM) following the denoising autoencoder paper linked above) as follows,
\[\begin{align}\label{eq:score_match_implicit} \mathcal{L}_{\text{ISM}}(\theta) = \mathbb{E}_{p_{0}(\mathbf{x})} \left[ \frac{1}{2} \left\Vert \psi(\mathbf{x}, \theta) \right\Vert^2 + \sum_{i=1}^{d} \frac{\partial \psi_{i}(\mathbf{x}, \theta)}{\partial \mathbf{x}_i} \right]. \end{align}\]The surprising fact proved by Aapo is that Eq. (\ref{eq:score_match}) and Eq. (\ref{eq:score_match_implicit}) are equivalent under some mild conditions. We now go over the proof which involves a trick of integration by parts.
\[\begin{align} \mathcal{L}_{\text{SM}}(\theta) & = \frac{1}{2} \mathbb{E}_{p_{0}(\mathbf{x})} \left[ \psi(\mathbf{x}, \theta)^{\top} \psi(\mathbf{x}, \theta) + \frac{\partial \log p_{0}(\mathbf{x}) }{\partial \mathbf{x}}^{\top} \frac{\partial \log p_{0}(\mathbf{x}) }{\partial \mathbf{x}} - 2 \psi(\mathbf{x}, \theta)^{\top} \frac{\partial \log p_{0}(\mathbf{x}) }{\partial \mathbf{x}} \right] \nonumber \\ & \Leftrightarrow \frac{1}{2} \mathbb{E}_{p_{0}(\mathbf{x})} \left[ \psi(\mathbf{x}, \theta)^{\top} \psi(\mathbf{x}, \theta) - 2 \psi(\mathbf{x}, \theta)^{\top} \frac{\partial \log p_{0}(\mathbf{x}) }{\partial \mathbf{x}} \right] \label{eq:sm_ism_equivalence} \end{align}\]The 2nd line is due to the fact that the middle term does not depend on $\theta$. The only troublesome term now is the 2nd one, which can be further re-written as
\[\begin{align} \mathbb{E}_{p_{0}(\mathbf{x})} \left[ \psi(\mathbf{x}, \theta)^{\top} \frac{\partial \log p_{0}(\mathbf{x}) }{\partial \mathbf{x}} \right] & = \sum_{i=1}^{d} \int_{-\infty}^{\infty} p_{0}(\mathbf{x}) \psi_{i}(\mathbf{x}, \theta) \frac{\partial \log p_{0}(\mathbf{x}) }{\partial \mathbf{x}_i} \mathrm{d} \mathbf{x} \nonumber \\ & = \sum_{i=1}^{d} \int_{-\infty}^{\infty} \left( \int_{-\infty}^{\infty} p_{0}(\mathbf{x}) \psi_{i}(\mathbf{x}, \theta) \frac{\partial \log p_{0}(\mathbf{x}) }{\partial \mathbf{x}_i} \mathrm{d}\mathbf{x}_i \right) \mathrm{d}\mathbf{x}_{-i} \nonumber \\ & = \sum_{i=1}^{d} \int_{-\infty}^{\infty} \left( \int_{-\infty}^{\infty} \psi_{i}(\mathbf{x}, \theta) \frac{\partial p_{0}(\mathbf{x}) }{\partial \mathbf{x}_i} \mathrm{d}\mathbf{x}_i \right) \mathrm{d}\mathbf{x}_{-i} \nonumber \\ & = \sum_{i=1}^{d} \int_{-\infty}^{\infty} \left( \left. \psi_{i}(\mathbf{x}, \theta) p_{0}(\mathbf{x}) \right\vert_{\mathbf{x}_i = -\infty}^{\infty} - \int_{-\infty}^{\infty} \frac{\partial \psi_{i}(\mathbf{x}, \theta)}{\partial \mathbf{x}_i} p_{0}(\mathbf{x}) \mathrm{d}\mathbf{x}_i \right) \mathrm{d}\mathbf{x}_{-i} \nonumber \\ & = \sum_{i=1}^{d} \int_{-\infty}^{\infty} \left( - \int_{-\infty}^{\infty} \frac{\partial \psi_{i}(\mathbf{x}, \theta)}{\partial \mathbf{x}_i} p_{0}(\mathbf{x}) \mathrm{d}\mathbf{x}_i \right) \mathrm{d}\mathbf{x}_{-i} \nonumber \\ & = - \int_{-\infty}^{\infty} \left( \sum_{i=1}^{d} \frac{\partial \psi_{i}(\mathbf{x}, \theta)}{\partial \mathbf{x}_i} \right) p_{0}(\mathbf{x}) \mathrm{d}\mathbf{x} \label{eq:cross_term} \end{align}\]where we use the notation \(\psi_{i}(\mathbf{x}, \theta) = \frac{\partial \log p_{\theta}(\mathbf{x}) }{\partial \mathbf{x}_i}\) and \(\mathbf{x}_{-i} = \{\mathbf{x}_j \vert j = 1, \cdots, d \& j \neq i\}\). From the 3rd line to the 4th line, we use the 1-dimensional integration by parts rule, i.e., \(uv = \int u \mathrm{d} v + \int v \mathrm{d} u\). From the 4th line to the 5th line, we use the assumption imposed by Aapo, i.e., \(\lim_{\mathbf{x} \rightarrow \pm\infty} \psi(\mathbf{x}, \theta) p_{0}(\mathbf{x}) = \mathbf{0}\). Plugging Eq. (\ref{eq:cross_term}) into Eq. (\ref{eq:sm_ism_equivalence}), we prove that Eq. (\ref{eq:score_match}) and Eq. (\ref{eq:score_match_implicit}) are equivalent.
Now let us reexamine the conditions (restated from the paper) when this equivalence holds:
- \(p_{0}(\mathbf{x})\) is differentiable;
- \(\mathbb{E}_{p_{0}(\mathbf{x})} \Vert \psi(\mathbf{x}, \theta) \Vert^2\) and \(\mathbb{E}_{p_{0}(\mathbf{x})} \Vert \psi_{\mathbf{x}}(\mathbf{x}, \theta) \Vert^2\) are finite for any \(\theta\);
- \(\lim_{\mathbf{x} \rightarrow \pm\infty} \psi(\mathbf{x}, \theta) p_{0}(\mathbf{x}) = \mathbf{0}\) for any \(\theta\).
The 1st condition is needed since we are talking about score. The 2nd one is required since otherwise the improper integral (i.e., expectation) is diverging. The 3rd one is introduced more or less for the technical convenience. But it also makes intuitive sense since we typically do not observe \(\infty\) as our data (i.e., \(\lim_{\mathbf{x} \rightarrow \pm\infty} p_{0}(\mathbf{x}) = \mathbf{0}\)) and would not like to put any density on \(\infty\) far places in the sample space (i.e., \(\lim_{\mathbf{x} \rightarrow \pm\infty} \psi(\mathbf{x}, \theta) = \mathbf{0}\)).
Consistency of Score Matching
We now show that the score matching estimator is statistically (locally) consistent. In other words, the global minimum of Eq. (\ref{eq:score_match_implicit}) gives us the correct estimation of data distribution, just like what maximum likelihood does. We need to first make some assumptions.
- Realizability: there exists \(\theta^{\ast}\) s.t. \(p_{\theta^{\ast}}(\mathbf{x}) = p_{0}(\mathbf{x})\)
- Non-Degenerate: for any \(\theta_1 \neq \theta_2\), \(p_{\theta_1}(\mathbf{x}) \neq p_{\theta_2}(\mathbf{x})\)
To arrive at the conclusion, we need to prove:
- \[\mathcal{L}_{\text{SM}}(\theta) = 0 \Leftrightarrow \theta = \theta^{\ast}\]
- As the number of samples approaches infinity, the global optimum of \(\mathcal{L}_{\text{ISM}}(\theta)\) converges to \(\theta^{\ast}\).
For the 1st result, let us first prove the right direction. Due to the realizability and the non-negativity of the energy function, we have \(p_{0}(\mathbf{x}) > 0\) for any \(\mathbf{x}\). Then \(\mathcal{L}_{\text{SM}}(\theta) = 0\) implies \(\frac{\partial \log p_{\theta}(\mathbf{x}) }{\partial \mathbf{x}} = \frac{\partial \log p_{0}(\mathbf{x}) }{\partial \mathbf{x}}\) which further implies \(p_{\theta}(\mathbf{x}) = p_{0}(\mathbf{x})\) since both are pdfs. Then we have \(\theta = \theta^{\ast}\) based on the non-degenerate assumption. The left direction is trivial.
For the 2nd result, we just need to apply the law of large numbers to show the convergence in probability.
One thing to note is that the energy function could be non-convex which allows multiple local minimas to exist. The above consistency result only applies when the optimization is started close to the global minimum. This also rises a question in practice. If we use a neural network to construct our energy function, we typically do not have any guarantees on the consistency any more due to its non-convex nature. Even worse, one can show that there exists some neural network (e.g., since ReLU-networks are homogeneous, one can normalize the weights/parameters without changing the output of the network) such that the constructed pdfs are degenerated.
Connections with Denoising Autoencoder
Let’s first review what denoising autoencoder does. Given clean data \(\mathbf{x}\), we corrupt it with some probabilistic mechanism (e.g., adding some Gaussian noise) \(q(\tilde{\mathbf{x}} \vert \mathbf{x})\) to get the corrupted data \(\tilde{\mathbf{x}}\). A denoising autoencoder takes the corrupted data \(\tilde{\mathbf{x}}\) as input and predicts the clean one \(\mathbf{x}\). The loss is typically the reconstruction error between prediction and the clean data. According to the above data generation process, we have the joint distribution \(q(\tilde{\mathbf{x}}, \mathbf{x}) = q(\tilde{\mathbf{x}} \vert \mathbf{x}) p_{0}(\mathbf{x})\). The key observation made by Pascal is adding random noise to the observed data is equivalent to sampling from a Parzen window estimated (a.k.a., KDE) density centered on the observed data.
From this insightful perspective, we first construct a score matching objective with the non-parametric estimation \(q(\tilde{\mathbf{x}})\) of the data distribution \(p_{0}(\mathbf{x})\) (e.g., you can first perform KDE on sampled data from \(p_{0}(\mathbf{x})\) to get \(q(\tilde{\mathbf{x}})\)),
\[\begin{align}\label{eq:nonparametric_score_match} \mathcal{L}_{\text{NP-SM}}(\theta) = \mathbb{E}_{q(\tilde{\mathbf{x}})} \left[ \frac{1}{2} \left\Vert \psi(\tilde{\mathbf{x}}, \theta) - \frac{\partial \log q(\tilde{\mathbf{x}}) }{\partial \tilde{\mathbf{x}}} \right\Vert^2 \right]. \end{align}\]Here we want to match our model score to the new data score, i.e., the score of the estimated kernel density \(q(\tilde{\mathbf{x}})\). The benefit of having this estimated kernel density is that we can now directly optimize the score matching objective rather than resort to its implicit form as in Eq. (\ref{eq:score_match_implicit}). More interestingly, we can show that the above objective is equivalent to the following one,
\[\begin{align}\label{eq:denoise_score_match} \mathcal{L}_{\text{DSM}}(\theta) = \mathbb{E}_{q(\mathbf{x}, \tilde{\mathbf{x}})} \left[ \frac{1}{2} \left\Vert \psi(\tilde{\mathbf{x}}, \theta) - \frac{\partial \log q(\tilde{\mathbf{x}} \vert \mathbf{x}) }{\partial \tilde{\mathbf{x}}} \right\Vert^2 \right]. \end{align}\]This objective, called denoising score matching, aims at matching model score evaluated at corrupted data to the score of the conditional noise density under the expectation of the joint distribution. The practical significance of this objective is that we can now use any differentiable noise density \(q(\tilde{\mathbf{x}} \vert \mathbf{x})\) (e.g., one can choose Gaussian for simplicity) to directly compute this objective without performing an extra KDE step.
We now prove the equivalence between Eq. (\ref{eq:nonparametric_score_match}) and Eq. (\ref{eq:denoise_score_match}) as follows,
\[\begin{align} \mathcal{L}_{\text{DSM}}(\theta) & = \frac{1}{2} \mathbb{E}_{q(\mathbf{x}, \tilde{\mathbf{x}})} \left[ \psi(\tilde{\mathbf{x}}, \theta)^{\top} \psi(\tilde{\mathbf{x}}, \theta) + \frac{\partial \log q(\tilde{\mathbf{x}} \vert \mathbf{x}) }{\partial \tilde{\mathbf{x}}}^{\top} \frac{\partial \log q(\tilde{\mathbf{x}} \vert \mathbf{x}) }{\partial \tilde{\mathbf{x}}} - 2 \psi(\tilde{\mathbf{x}}, \theta)^{\top} \frac{\partial \log q(\tilde{\mathbf{x}} \vert \mathbf{x}) }{\partial \tilde{\mathbf{x}}} \right] \nonumber \\ & \Leftrightarrow \frac{1}{2} \mathbb{E}_{q(\mathbf{x}, \tilde{\mathbf{x}})} \left[ \psi(\tilde{\mathbf{x}}, \theta)^{\top} \psi(\tilde{\mathbf{x}}, \theta) - 2 \psi(\tilde{\mathbf{x}}, \theta)^{\top} \frac{\partial \log q(\tilde{\mathbf{x}} \vert \mathbf{x}) }{\partial \tilde{\mathbf{x}}} \right] \nonumber \\ & = \frac{1}{2} \mathbb{E}_{q(\tilde{\mathbf{x}})} \left[ \psi(\tilde{\mathbf{x}}, \theta)^{\top} \psi(\tilde{\mathbf{x}}, \theta) \right] - \mathbb{E}_{q(\mathbf{x}, \tilde{\mathbf{x}})} \left[ \psi(\tilde{\mathbf{x}}, \theta)^{\top} \frac{\partial \log q(\tilde{\mathbf{x}} \vert \mathbf{x}) }{\partial \tilde{\mathbf{x}}} \right] \nonumber \\ & = \frac{1}{2} \mathbb{E}_{q(\tilde{\mathbf{x}})} \left[ \psi(\tilde{\mathbf{x}}, \theta)^{\top} \psi(\tilde{\mathbf{x}}, \theta) \right] - \int\int \frac{q(\mathbf{x}, \tilde{\mathbf{x}})}{q(\tilde{\mathbf{x}} \vert \mathbf{x})} \psi(\tilde{\mathbf{x}}, \theta)^{\top} \frac{\partial q(\tilde{\mathbf{x}} \vert \mathbf{x}) }{\partial \tilde{\mathbf{x}}} \mathrm{d}\mathbf{x} \mathrm{d}\tilde{\mathbf{x}} \nonumber \\ & = \frac{1}{2} \mathbb{E}_{q(\tilde{\mathbf{x}})} \left[ \psi(\tilde{\mathbf{x}}, \theta)^{\top} \psi(\tilde{\mathbf{x}}, \theta) \right] - \int \psi(\tilde{\mathbf{x}}, \theta)^{\top} \frac{\partial}{\partial \tilde{\mathbf{x}}} \int q(\tilde{\mathbf{x}} \vert \mathbf{x}) q(\mathbf{x}) \mathrm{d}\mathbf{x} \mathrm{d}\tilde{\mathbf{x}} \nonumber \\ & = \frac{1}{2} \mathbb{E}_{q(\tilde{\mathbf{x}})} \left[ \psi(\tilde{\mathbf{x}}, \theta)^{\top} \psi(\tilde{\mathbf{x}}, \theta) \right] - \int \psi(\tilde{\mathbf{x}}, \theta)^{\top} \frac{\partial q(\tilde{\mathbf{x}})}{\partial \tilde{\mathbf{x}}} \mathrm{d}\tilde{\mathbf{x}} \nonumber \\ & = \frac{1}{2} \mathbb{E}_{q(\tilde{\mathbf{x}})} \left[ \psi(\tilde{\mathbf{x}}, \theta)^{\top} \psi(\tilde{\mathbf{x}}, \theta) \right] - \mathbb{E}_{q(\tilde{\mathbf{x}})} \left[ \psi(\tilde{\mathbf{x}}, \theta)^{\top} \frac{\partial \log q(\tilde{\mathbf{x}})}{\partial \tilde{\mathbf{x}}} \right] \nonumber \\ & \Leftrightarrow \mathcal{L}_{\text{NP-SM}}(\theta). \label{eq:sm_dsm_equivalence} \end{align}\]