Density Ratio Trick

Renjie Liao · June 28, 2020

Machine Learning   Statistics

There are many scenarios in machine learning and statistics where we hope to compute the ratio of two density functions, e.g., KL-divergence, mutual information, importance sampling, and hypothesis testing. Specifically, given two density \(p(X)\) and \(q(X)\) on the same sample space, we’d like to compute

\[\begin{align} r(X) = \frac{p(X)}{q(X)}. \end{align}\]

It is often intractable in practice to compute this ratio, e.g., when we do not have the exact density \(p\) and \(q\) (only known up to some normalization factor in energy based models). The nice thing about the density ratio trick is that it allows us to compute this ratio even under such circumstances. The core idea is very simple and elegant, i.e., rephrasing the original problem as a classification one. We first build a generative model as follows,

\[\begin{align}\label{eq:generation} \mathbb{P}(Z = 0) & = \frac{1}{2} \nonumber \\ \mathbb{P}(Z = 1) & = \frac{1}{2} \nonumber \\ \mathbb{P}(X \vert Z = 0) & = p(X) \nonumber \\ \mathbb{P}(X \vert Z = 1) & = q(X) \end{align}\]

Here you could treat \(Z\) as the binary class label. Given a sample drawn from this generative model, we classify it by computing the posterior (Bayes Theorem),

\[\begin{align}\label{eq:classification} \mathbb{P}(Z \vert X) = \frac{\mathbb{P}(X \vert Z) \mathbb{P}(Z)}{\mathbb{P}(X)} \end{align}\]

Therefore, the density ratio is computed as,

\[\begin{align}\label{eq:density_ratio_trick} r(X) = \frac{p(X)}{q(X)} = \frac{\mathbb{P}(X \vert Z = 0)}{\mathbb{P}(X \vert Z = 1)} = \frac{\mathbb{P}(Z = 0 \vert X)}{\mathbb{P}(Z = 1 \vert X)} = \frac{\mathbb{P}(Z = 0 \vert X)}{1 - \mathbb{P}(Z = 0 \vert X)}. \end{align}\]

All this trick implies that in order to compute the ratio, we only need to (1) sample data from the generative model (only require sampling from \(p(X)\) and \(q(X)\), not require knowing the exact forms of densities) (2) construct a classifier which outputs \(\mathbb{P}(Z = 0 \vert X)\). Therefore, we transform the original ratio-computing problem into an equivalent classification problem.

However, there is a catch. We have the freedom of using an arbitrary classifier to approximate the true posterior. But the optimal one requires knowing the density. Fortunately, in practice, people can learn a classifier with some surrogate loss like max-margin so that the estimation of ratio is good. How to quantify the estimation error for this type of estimator is an interesting topic.