On the Softmax Function

Renjie Liao · July 21, 2020

Machine Learning   Deep Learning

Let us take a look at the softmax function which is frequently used in deep learning.

\[\begin{align} \bm{y} = \text{softmax}(\bm{x}) = g(\bm{x}) \qquad \text{where} \quad \bm{y}_i = \frac{\exp(\bm{x}_i)}{\sum_{j=1}^{n}\exp(\bm{x}_j)}, \end{align}\]

where \(\bm{x} \in \mathbb{R}^{n}\). Note that \(\bm{y}: \mathbb{R}^{n} \rightarrow \mathbb{R}^{n}\) and \(\bm{y}_i: \mathbb{R}^{n} \rightarrow \mathbb{R}\).

Convexity

Let us first show that function \(\bm{y}_i\) is convex in \(\mathbb{R}^{n}\) which itself is a convex set. First note that

\[\begin{align} \log \bm{y}_i = \bm{x}_i + \log\sum_{j=1}^{n}\exp(\bm{x}_j), \end{align}\]

We can denote the log-sum-exp function as \(f(\bm{x}) = \log\sum_{j=1}^{n}\exp(\bm{x}_j)\). If we can show the log-sum-exp is convex, then it is clear \(\bm{y}_i\) is convex since \(h(\bm{x}) = \bm{x}_i\) is convex and the addition preserves convexity. To prove the convexity of the log-sum-exp, we only need to show its Hessian is positive semidefinite since it is at least twice differentiable. We first compute its gradient. Denoting \(Z = \sum_{j=1}^{n}\exp(\bm{x}_j)\) and \(\bm{\alpha}_i = \exp(\bm{x}_i)\), we have

\[\begin{align}\label{eq:gradient_lse} \nabla f(\bm{x}) = \nabla \log\sum_{i=1}^{n}\exp(\bm{x}_i) = \frac{\bm{\alpha}}{Z} = \text{softmax}(\bm{x}) \end{align}\]

Then we can compute the 2nd derivative as follow,

\[\begin{align} \frac{ \partial f }{ \partial \bm{x}_i \bm{x}_j } & = \frac{ \partial }{ \partial \bm{x}_i } \frac{\bm{\alpha}_j}{Z} \nonumber \\ & = \frac{ \frac{ \partial \bm{\alpha}_j }{ \partial \bm{x}_i } Z - \bm{\alpha}_j \frac{ \partial Z}{ \partial \bm{x}_i } }{Z^2} \nonumber \\ & = \begin{cases} \frac{\bm{\alpha}_i}{Z} - \frac{\bm{\alpha}_i^2}{Z^2} \qquad & i = j \nonumber \\ - \frac{ \bm{\alpha}_i \bm{\alpha}_j }{Z^2} \qquad & i \neq j \nonumber \end{cases} \nonumber \\ & = \begin{cases} \bm{y}_i (1 - \bm{y}_i) \qquad & i = j \nonumber \\ - \bm{y}_i \bm{y}_j \qquad & i \neq j \nonumber \end{cases} \end{align}\]

Therefore, the Hessian of \(f\) is

\[\begin{align} \nabla^2 f(\bm{x}) & = J_\bm{g}(\bm{x}) = \begin{bmatrix} \bm{y}_1 (1 - \bm{y}_1) & - \bm{y}_1 \bm{y}_2 & - \bm{y}_1 \bm{y}_3 & \cdots & - \bm{y}_1 \bm{y}_n\\ - \bm{y}_2 \bm{y}_1 & \bm{y}_2 (1 - \bm{y}_2) & - \bm{y}_2 \bm{y}_3 & \cdots & - \bm{y}_2 \bm{y}_n\\ \vdots & \vdots & \vdots & \ddots & \vdots\\ - \bm{y}_n \bm{y}_1 & - \bm{y}_n \bm{y}_2 & - \bm{y}_n \bm{y}_3 & \cdots & \bm{y}_n (1 - \bm{y}_n)\\ \end{bmatrix} \nonumber \\ & = \text{diag}(\bm{y}) - \bm{y} \bm{y}^{\top} \label{eq:jacobian_softmax} \end{align}\]

where \(\text{diag}(\bm{y})\) indicates a all zero matrix with diagonal vector set to be \(\bm{y}\). Note that the Hessian of the log-sum-exp equals the Jacobian of the softmax. To prove \(\nabla^2 f(\bm{x})\) is positive semidefinite, we need to show for all \(\bm{v} \in \mathbb{R}^{n}\), \(\bm{v}^{\top} \nabla^2 f(\bm{x}) \bm{v} \ge 0\). To see this,

\[\begin{align} \bm{v}^{\top} \nabla^2 f(\bm{x}) \bm{v} & = \bm{v}^{\top} \text{diag}(\bm{y}) \bm{v} - \bm{v}^{\top} \bm{y} \bm{y}^{\top} \bm{v} \nonumber \\ & = \sum_{i=1}^{n} \bm{y}_i \bm{v}_i^2 - \left( \sum_{i=1}^{n} \bm{v}_i \bm{y}_i \right)^2 \nonumber \\ & = \left(\sum_{i=1}^{n} \bm{y}_i \right) \left( \sum_{i=1}^{n} \bm{y}_i \bm{v}_i^2 \right) - \left( \sum_{i=1}^{n} \bm{v}_i \bm{y}_i \right)^2 \label{eq:hessian_lse} \end{align}\]

where the last equality is due to the fact that softmax generates a probability vector which should sum to one. Denoting \(\bm{a}_i = \sqrt{\bm{y}_i}\) and \(\bm{b}_i = \sqrt{\bm{y}_i}\bm{v}_i\), based on Cauchy Schwarz inequality, we have

\[\begin{align} \Vert \bm{a} \Vert_2^2 \Vert \bm{b} \Vert_2^2 - \Vert \bm{a} \bm{b} \Vert_2^2 = \left(\sum_{i=1}^{n} \bm{y}_i \right) \left( \sum_{i=1}^{n} \bm{y}_i \bm{v}_i^2 \right) - \left( \sum_{i=1}^{n} \bm{v}_i \bm{y}_i \right)^2 \ge 0, \end{align}\]

which proves the positive semidefiniteness and in turn proves the convexity.

Lipschitz Continuity

Now let us check the Lipschitz continuity of the softmax function. First, recall the definition of \(L\)-Lipschitz continuous function as follows. Given two metric spaces \((X, d_X)\) and \((Y, d_Y)\), where \(d_X, d_Y\) denote the metrics on the set \(X\) and \(Y\) respectively. A function \(f: X \rightarrow Y\) is called Lipschitz continuous with Lipschitz constant \(L\), if for all \(\bm{u}, \bm{v} \in X\), we have

\[\begin{align} d_Y \left( f(\bm{u}), f(\bm{v}) \right) \le L d_X \left( \bm{u}, \bm{v} \right) \end{align}\]

In our case, we consider \(X = Y = \mathbb{R}^{n}\) and \(d_X(\bm{a}, \bm{b}) = d_Y(\bm{a}, \bm{b}) = \Vert \bm{a} - \bm{b} \Vert_2\).

Let us first prove the following known result.

Lemma 1. A twice differentiable and convex function \(f: \mathbb{R}^{n} \rightarrow \mathbb{R}\) has a Lipschitz continuous gradient with Lipschitz constant \(L > 0\), if and only if for all \(\bm{u}, \bm{v} \in \mathbb{R}^{n}\),

\[\begin{align}\label{eq:Lipschitz_condition} 0 \le \bm{v}^{\top} \nabla^2 f(\bm{u}) \bm{v} \le L \Vert \bm{v} \Vert_2^2 \end{align}\]

Proof: We first prove the “if” direction, i.e., we want to show that if Eq. \(\eqref{eq:Lipschitz_condition}\) holds, then \(\nabla f(\bm{x})\) is continuous and

\[\begin{align} \Vert \nabla f(\bm{u}) - \nabla f(\bm{v}) \Vert_2 \le L \Vert \bm{u} - \bm{v} \Vert_2 \end{align}\]

First, since differentiability implies continuity, the existence of \(\nabla^2 f(\bm{u})\) implies that \(\nabla f(\bm{x})\) is continuous. Based on the Gradient Theorem, for any \(\bm{u}, \bm{v} \in \mathbb{R}^{n}\), we have the following line integral forms,

\[\begin{align} f(\bm{v}) & = f(\bm{u}) + \int_{0}^{1} \nabla f \left(\bm{u} + \tau (\bm{v} - \bm{u}) \right)^{\top} (\bm{v} - \bm{u}) \mathrm{d} \tau \nonumber \\ \nabla f(\bm{v}) & = \nabla f(\bm{u}) + \int_{0}^{1} \nabla^2 f \left(\bm{u} + \tau (\bm{v} - \bm{u}) \right) (\bm{v} - \bm{u}) \mathrm{d} \tau \end{align}\]

Therefore, for any \(u, v \in \mathbb{R}^{n}\), we have

\[\begin{align} f(\bm{v}) & = f(\bm{u}) + \int_{0}^{1} \nabla f \left(\bm{u} + \tau (\bm{v} - \bm{u}) \right)^{\top} (\bm{v} - \bm{u}) \mathrm{d} \tau \nonumber \\ & = f(\bm{u}) + \nabla f(\bm{u})^{\top} (\bm{v} - \bm{u}) + \int_{0}^{1} \left( \nabla f \left(\bm{u} + \tau (\bm{v} - \bm{u}) \right) - \nabla f(\bm{u}) \right)^{\top} (\bm{v} - \bm{u}) \mathrm{d} \tau \nonumber \\ & = f(\bm{u}) + \nabla f(\bm{u})^{\top} (\bm{v} - \bm{u}) + \int_{0}^{1} \left( \int_{0}^{\tau} \nabla^2 f \left(\bm{u} + \lambda (\bm{v} - \bm{u}) \right) (\bm{v} - \bm{u}) \mathrm{d} \lambda \right)^{\top} (\bm{v} - \bm{u}) \mathrm{d} \tau \nonumber \\ & = f(\bm{u}) + \nabla f(\bm{u})^{\top} (\bm{v} - \bm{u}) + \int_{0}^{1} \int_{0}^{\tau} (\bm{v} - \bm{u})^{\top} \nabla^2 f \left(\bm{u} + \lambda (\bm{v} - \bm{u}) \right) (\bm{v} - \bm{u}) \mathrm{d} \lambda \mathrm{d} \tau \nonumber \\ & \le f(\bm{u}) + \nabla f(\bm{u})^{\top} (\bm{v} - \bm{u}) + \int_{0}^{1} \int_{0}^{\tau} L \Vert \bm{v} - \bm{u} \Vert_2^2 \mathrm{d} \lambda \mathrm{d} \tau \nonumber \\ & = f(\bm{u}) + \nabla f(\bm{u})^{\top} (\bm{v} - \bm{u}) + \frac{L}{2} \Vert \bm{v} - \bm{u} \Vert_2^2. \label{eq:lemma_1_tmp_1} \end{align}\]

Swapping \(\bm{u}\) and \(\bm{v}\), we have

\[\begin{align}\label{eq:lemma_1_tmp_2} f(\bm{u}) \le f(\bm{v}) + \nabla f(\bm{v})^{\top} (\bm{u} - \bm{v}) + \frac{L}{2} \Vert \bm{u} - \bm{v} \Vert_2^2. \end{align}\]

Summing Eq. \(\eqref{eq:lemma_1_tmp_1}\) and Eq. \(\eqref{eq:lemma_1_tmp_2}\), we have

\[\begin{align}\label{eq:lemma_1_tmp_3} \left( \nabla f(\bm{u}) - \nabla f(\bm{v}) \right)^{\top} (\bm{u} - \bm{v}) \le L \Vert \bm{u} - \bm{v} \Vert_2^2. \end{align}\]

Consider the function \(\phi(\bm{v}) = f(\bm{v}) - \nabla f(\bm{u})^{\top} v\). It is clear that \(\phi\) is convex and twice differentiable. Also, \(\nabla \phi(\bm{v}) = \nabla f(\bm{v}) - \nabla f(\bm{u})\) and \(\nabla^2 \phi(\bm{v}) = \nabla^2 f(\bm{v})\). Therefore, similar to Eq. \(\eqref{eq:lemma_1_tmp_1}\), for any \(\bm{x}, \bm{y} \in \mathbb{R}^{n}\), we have

\[\begin{align} \phi(\bm{y}) & \le \phi(\bm{x}) + \nabla \phi(\bm{x})^{\top} (\bm{y} - \bm{x}) + \frac{L}{2} \Vert \bm{y} - \bm{x} \Vert_2^2. \label{eq:lemma_1_tmp_4} \end{align}\]

Setting \(\bm{x} = \bm{v}\) and \(\bm{y} = \bm{v} - \frac{1}{L} \nabla \phi(\bm{v})\), we have,

\[\begin{align}\label{eq:lemma_1_tmp_5} \phi(\bm{u}) \le \phi(\bm{v} - \frac{1}{L} \nabla \phi(\bm{v})) & \le \phi(\bm{v}) - \frac{1}{2L} \Vert \nabla \phi(\bm{v}) \Vert_2^2. \end{align}\]

where the first inequality is due to the fact that \(\phi\) takes the minimum value at \(u\). Therefore, for any \(\bm{u}, \bm{v} \in \mathbb{R}^{n}\), we have,

\[\begin{align}\label{eq:lemma_1_tmp_6} \frac{1}{2L} \Vert \nabla f(\bm{v}) - \nabla f(\bm{u}) \Vert_2^2 \le f(\bm{v}) - f(\bm{u}) - \nabla f(\bm{u})^{\top} (\bm{v} - \bm{u}) \end{align}\]

We again swap \(u\) and \(v\) to obtain

\[\begin{align}\label{eq:lemma_1_tmp_7} \frac{1}{2L} \Vert \nabla f(\bm{u}) - \nabla f(\bm{v}) \Vert_2^2 \le f(\bm{u}) - f(\bm{v}) - \nabla f(\bm{v})^{\top} (\bm{u} - \bm{v}) \end{align}\]

Summing Eq. \(\eqref{eq:lemma_1_tmp_6}\) and Eq. \(\eqref{eq:lemma_1_tmp_7}\), we have

\[\begin{align}\label{eq:lemma_1_tmp_8} \frac{1}{L} \Vert \nabla f(\bm{u}) - \nabla f(\bm{v}) \Vert_2^2 \le \left( \nabla f(\bm{u}) - \nabla f(\bm{v}) \right)^{\top} (\bm{u} - \bm{v}) \end{align}\]

Combining Eq. \(\eqref{eq:lemma_1_tmp_3}\) and Eq. \(\eqref{eq:lemma_1_tmp_8}\), we finish the proof of the “if” direction.

Now let us look at the “only if” direction. Again, we have the following line integral form for any \(\tau > 0\)

\[\begin{align} \int_{0}^{\tau} \nabla^2 f \left(\bm{u} + \lambda (\bm{v} - \bm{u}) \right) (\bm{v} - \bm{u}) \mathrm{d} \lambda = \nabla f(\bm{u} + \tau (\bm{v} - \bm{u})) - \nabla f(\bm{u}) \end{align}\]

Therefore, we have

\[\begin{align} \int_{0}^{\tau} (\bm{v} - \bm{u})^{\top} \nabla^2 f \left(\bm{u} + \lambda (\bm{v} - \bm{u}) \right) (\bm{v} - \bm{u}) \mathrm{d} \lambda & = (\bm{v} - \bm{u})^{\top} \left( \nabla f(\bm{u} + \tau (\bm{v} - \bm{u})) - \nabla f(\bm{u}) \right) \nonumber \\ & \le \left\Vert \bm{v} - \bm{u} \right\Vert \left\Vert \nabla f(\bm{u} + \tau (\bm{v} - \bm{u})) - \nabla f(\bm{u}) \right\Vert \nonumber \\ & \le \tau L \left\Vert \bm{v} - \bm{u} \right\Vert^2. \end{align}\]

Dividing the above inequality by \(\tau\) on both sides and taking the limit \(\tau \rightarrow 0\), we finish the proof.

\[\tag*{$\blacksquare$}\]

Work out the Lipschitz constant

Let us work out the Lipschitz constant of the softmax function. We can apply Lemma 1 to the log-sum-exp function \(f\). From Eq. \(\eqref{eq:hessian_lse}\), we have

\[\begin{align} \bm{v}^{\top} \nabla^2 f(\bm{x}) \bm{v} = \sum_{i=1}^{n} \bm{y}_i \bm{v}_i^2 - \left( \sum_{i=1}^{n} \bm{v}_i \bm{y}_i \right)^2 \le \Vert \bm{v} \Vert_2^2. \end{align}\]

Recall from Eq. \(\eqref{eq:gradient_lse}\) that the gradient of the log-sum-exp function is just the softmax function which shows its Lipschitz constant is 1.

Can we improve the Lipschitz constant further?

From Lemma 1, we know that if we can find the minimum value of \(L\) such that Eq. \(\eqref{eq:Lipschitz_condition}\) holds, then we get the minimum Lipschitz constant. Recall the definition of Rayleigh quotient, we have

\[\begin{align} R(\nabla^2 f(\bm{u}), \bm{v}) = \frac{\bm{v}^{\top} \nabla^2 f(\bm{u}) \bm{v}}{\bm{v}^{\top}\bm{v}} \end{align}\]

where \(f\) is again the log-sum-exp function. Therefore, fixing \(u\), we have

\[\begin{align} \max_{\bm{v} \neq \bm{0}} R(\nabla^2 f(\bm{u}), \bm{v}) = \lambda_{\max}(\bm{u}), \end{align}\]

where \(\lambda_{\max}(\bm{u})\) is the maximum eigenvalue of \(\nabla^2 f(\bm{u})\). Therefore, the minimum \(L\) should satisfy \(L = \max_{\bm{u}} \lambda_{\max}(\bm{u})\). Since \(\nabla^2 f(\bm{u}) = J_g(\bm{u})\) and is positive semidefinite, we have \(\lambda_{\max}(\bm{u}) = \Vert J_g(\bm{u}) \Vert_2\) where \(J_g(\bm{u})\) is the Jacobian of the softmax function and \(\Vert J_g(\bm{u}) \Vert_2\) is its spectral norm. From Eq. \(\eqref{eq:jacobian_softmax}\), we can further rewrite it as a constrained optimization problem,

\[\begin{align} L = \max_{\bm{y}} \quad & \Vert \text{diag}(\bm{y}) - \bm{y} \bm{y}^{\top} \Vert_2 \nonumber \\ \text{s.t.} \quad & \sum_{i=1}^{n} \bm{y}_{i} = 1 \nonumber \\ & \bm{y}_{i} \ge 0 \end{align}\]

Unfortunately, it seems hard to work out the analytical solution of this problem. Therefore, we can not get a simple form Lipschitz constant. But this provides a way of approaching a better one.