A PAC-Bayesian Approach to Spectrally-Normalized Margin Bounds for Neural Networks

Renjie Liao · July 1, 2020

Machine Learning   Deep Learning   Statistical Learning Thoery

In this post, we will review a great paper which bounds the generalization error of a broad class of models, including a popular class of deep neural networks, using the PAC-Bayesian framework.

Background

We first introduce the background. We are considering a class \(\mathcal{H}\) of neural networks where an instance \(f_w \in \mathcal{H}\) is shown as below,

\[f_w(x) = W_{d} \phi( W_{d-1} \phi( \dots ( W_{1} x ) ) ),\]

where \(w = \text{vec}([W_{1}, \dots, W_{d}])\) is the concatenated vector of all weights1, \(\phi\) is ReLU2, \(d\) is the number of layers. This class encapsulates many fully connected networks and convolutional neural networks used in practice.

A training sample, i.e., a pair of input and output, denoting as \(z = (x, y) \in \mathcal{Z}\), is drawn from some unknown distribution \(D\) in an i.i.d. manner. We use the following margin loss to learn the model.

\[L_{\gamma}(f_w, D) = \mathbb{P}_{z \sim D} \left[ f_w(x)[y] \le \gamma + \max_{j \neq y} f_w(x)[j] \right]\]

We denote its empirical estimation based on a size-\(m\) training set \(S\) as

\[L_{\gamma}(f_w, S) = \frac{1}{m} \sum_{z \in S} \bm{1}\left[ f_w(x)[y] \le \gamma + \max_{j \neq y} f_w(x)[j] \right]\]

PAC-Bayesian approaches introduce a prior distribution \(P\) and a posterior distribution \(Q\) over the same model class \(\mathcal{H}\). Under this setting, we care about the following loss,

\[\begin{align} L_{\gamma}(Q, D) & = \mathbb{E}_{f_w \sim Q} \left[ L_{\gamma}(f_w, D) \right] \nonumber \\ L_{\gamma}(Q, S) & = \mathbb{E}_{f_w \sim Q} \left[ L_{\gamma}(f_w, S) \right]. \end{align}\]

The main result (due to David McAllester) of the PAC-Bayesian approaches is as follows.

Theorem 1. Let \(D\) be an arbitrary distribution over \(Z\), i.e., the space of input and output pair. Let \(\mathcal{H}\) be a hypothesis class and let \(\ell: \mathcal{H} \times Z \rightarrow [0, 1]\) be a loss function. Let \(P\) be a prior distribution over \(\mathcal{H}\) and let \(\delta \in (0, 1)\). Then, with probability \(1 - \delta\) over the choice of an i.i.d. size-\(m\) training set \(S\) according to \(D\), for all distributions \(Q\) over \(\mathcal{H}\), we have

\[L_{\gamma}(Q, D) \le L_{\gamma}(Q, S) + \sqrt{\frac{\text{KL}(Q \Vert P) + \ln \frac{2m}{\delta} }{2(m-1)}}\]

The good thing about this theorem is it holds for any posterior \(Q\) which of course includes the one returned by the learning process.

Point Estimated PAC-Bayes

If you would like to compare the generalization bound derived from PAC-Bayes with the one derived from other approaches, you will find that they are not directly comparable. This is caused by the fact that typical approaches bound the point estimation of the generalization error, i.e., \(L_{\gamma}(f_w, D)\), whereas PAC-Bayes bounds the Bayesian estimation, i.e., \(L_{\gamma}(Q, D)\). To make a comparison, one can actually turn a PAC-Bayes bound in Theorem 1 into a point estimation form, i.e., the situation where we just have a deterministic model. This paper takes the same proof technique as in the original paper by David McAllester and provides the following general result.

Lemma 1. Let \(f_w(x): \mathcal{X} \rightarrow \mathbb{R}^k\) be any model with parameters \(w\), and \(P\) be any distribution on the parameters that is independent of the training data. Then, for any \(\gamma, \delta > 0\), with probability at least \(1 - \delta\) over the size-\(m\) training set, for any \(w\) and any random perturbation \(u\), s.t. \(\mathbb{P}(\max_{x \in \mathcal{X}} \Vert f_{w + u}(x) - f_w(x) \Vert_{\infty} < \frac{\gamma}{4} ) > \frac{1}{2}\), we have:

\[L_{0}(f_w, D) \le L_{\gamma}(f_w, S) + \sqrt{\frac{\text{KL}(w + u \Vert P) + \ln \frac{2m}{\delta} }{2(m-1)}}\]

Here the posterior \(Q\) is the distribution of random perturbation \(u\) shifted by the learned weights \(w\).

Main Result and Sketch of Proof

Before we introduce the main results of the paper, we make a few assumptions:

  • Input data is contained in a \(B\)-ball, i.e., \(x \in \mathcal{X}_{B, n} = \{x \in \mathbb{R}^{n} \vert \sum_{i=1}^{n} x_i^2 \le B^2\}\).
  • The maximum number of hidden units per layer is denoted as \(h\).

The key idea of the paper is as follows:

  1. Set up the prior \(P = \mathcal{N}(0, \sigma^2 I)\) on weights.
  2. Construct the posterior \(Q\) by adding Gaussian perturbation \(u\) to learned weights \(w\), i.e., \(Q(u + w) = \mathcal{N}(w, \sigma^2 I)\).
  3. With certain probability, the spectral norm of the perturbation \(u\) is small, i.e., \(\mathbb{P}_{U \sim \mathcal{N}(0, \sigma^2 I)}(\Vert U \Vert_2 > t) \le 2he^{-t^2/2h\sigma^2}\)
  4. Bound the maximum change in the output of the model with the above perturbation, i.e., Lemma 2 in the paper.
  5. Applying Lemma 1 to get the generalization bound.

We restate the generalization bound as follows,

Theorem 2. For any \(B, h, d > 0\), let \(f_w(x): \mathcal{X}_{B, n} \rightarrow \mathbb{R}^k\) be a \(d\)-layer neural network with ReLU activations. Then, for any \(\gamma, \delta > 0\), with probability at least \(1 - \delta\) over the size-\(m\) training set, for any \(w\), we have:

\[{\scriptsize L_{0}(f_w, D) \le L_{\gamma}(f_w, S) + \mathcal{O} \left( \sqrt{\frac{(Bd)^2h\ln(dh) \prod\limits_{i=1}^d \Vert W_i \Vert_{2}^{2} \sum\limits_{i=1}^{d} \frac{\Vert W_i \Vert_{F}^{2}}{\Vert W_i \Vert_{2}^{2}} + \ln \frac{dm}{\delta} }{\gamma^2 m}} \right) }\]

We show the proof sketch for this theorem as below.

To leverage Lemma 1, we need to show that this type of neural networks has bounded output after perturbing the weights. As in step 3, for Gaussian perturbation, we know with certain probability, the perturbation can be bounded which in turn allows us to bound the change in the output of neural networks, i.e., \(f_{w+u}(x) - f_w(x)\). The main idea is to follow an inductive way to show that the bound depends on the norms of the weights. We hope the bound of output is no larger than the margin \(\gamma\) as in Lemma 1 which implies that the variance shared by the prior and the perturbation is bounded by the norm of weights in some way. This is tricky since it means our prior depends on the learned weights which breaks the independence of the prior and the data. To work around this, authors proposed to discretize the range of the weight norm. They prove that the generalization bound holds for each discretization point and then prove it holds for all cases using the union bound. This is smart in a way that the discretization breaks the dependence of the prior (its variance) on the data.

After satisfying the condition of Lemma 1, then the only thing left is to derive the KL. For two Gaussians, it is trivial to compute the KL which depends on the weights and the variance. Note that the variance is already bounded from the previous step.

Inspiration

This work is inspiring since it provides an simple yet elegant framework to derive generalization bound for a large class of models. To apply it to customized models, the only challenging thing is to bound the difference between the output and the perturbed one. However, it is also shown that this type of PAC-Bayes bound is not as sharp as the one derived from the covering number (I will cover in a future post).

Footnotes:

  1. Bias is absorbed in weight. 

  2. The results could be generalized to other nonlinear activations.