Distances between two multivariate normals

elementary results complementary to the Matrix Cookbook

KL divergence of two Gaussians

Let the two Gaussians be $\mathcal{N}(\mu_p, \Sigma_p)$ and $\mathcal{N}(\mu_q, \Sigma_q)$. Let them be $d$-dimensional. Their KL divergence is

\[\boxed{D_{KL}(p || q) := \int_x p(x) \log \frac{p(x)}{q(x)} = \frac{1}{2} \left[ \log\frac{|\Sigma_q|}{|\Sigma_p|} - d + (\mu_p - \mu_q)^T \Sigma_q^{-1} (\mu_p - \mu_q) + \text{tr}(\Sigma_q^{-1}\Sigma_p) \right].}\]

This formula is used in variational autoencoders.

Proof. By definition,

\[\begin{align*} D_{KL}(p || q) = \mathbb{E}_p [\log p - \log q] &= \mathbb{E}_p \left[ -\frac{1}{2}(x-\mu_p)^T\Sigma_p^{-1}(x-\mu_p) -\frac{1}{2}\log|\Sigma_p| \right. \\ &\qquad\quad \left. +\frac{1}{2}(x-\mu_q)^T\Sigma_q^{-1}(x-\mu_q) +\frac{1}{2}\log|\Sigma_q| \right]. \end{align*}\]

Clearly,

\[\mathbb{E}_p \left[ -\frac{1}{2}\log|\Sigma_p| \right] = -\frac{1}{2}\log|\Sigma_p|, \quad \mathbb{E}_p \left[ \frac{1}{2}\log|\Sigma_q| \right] = \frac{1}{2}\log|\Sigma_q|,\]

and

\[\mathbb{E}_p \left[ -\frac{1}{2}(x-\mu_p)^T\Sigma_p^{-1}(x-\mu_p) \right] = -\frac{d}{2}.\]

Hence, the KL divergence

\[D_{KL}(p || q) = \frac{1}{2} \left[ \log\frac{|\Sigma_q|}{|\Sigma_p|} - d \right] +\frac{1}{2} \mathbb{E}_p \left[ (x-\mu_q)^T\Sigma_q^{-1}(x-\mu_q) \right].\]

We now compute the remaining expectation term by using a change of variable $y = \Sigma_p^{-\frac{1}{2}}(x-\mu_p)$, which makes $y$ standard normal. Then, $\Sigma_q^{-\frac{1}{2}}(x-\mu_q) = \Sigma_q^{-\frac{1}{2}}(\Sigma_p^{\frac{1}{2}}y + \mu_p - \mu_q)$ and

\[\begin{align*} \mathbb{E}_{x \sim p} \left[ (x-\mu_q)^T\Sigma_q^{-1}(x-\mu_q) \right] &= \mathbb{E}_{y \sim \mathcal{N}(0,I)} \left[ (\Sigma_p^{\frac{1}{2}}y + \mu_p - \mu_q)^T\Sigma_q^{-1}(\Sigma_p^{\frac{1}{2}}y + \mu_p - \mu_q) \right] \\ &= \text{tr}(\Sigma_q^{-1}\Sigma_p) + (\mu_p - \mu_q)^T\Sigma_q^{-1}(\mu_p - \mu_q), \end{align*}\]

which completes the proof. $\qquad\square$


Wasserstein-2 distance of two Gaussians

Let the two Gaussians be $\mathcal{N}(\mu_p, \Sigma_p)$ and $\mathcal{N}(\mu_q, \Sigma_q)$. Let them be $d$-dimensional. Their squared Wasserstein-2 distance is

\[\boxed{W_2^2(p,q) := \inf \mathbb{E}[ \| x_p - x_q \|_2^2 ] = \| \mu_p - \mu_q \|_2^2 + \text{tr} \left( \Sigma_p + \Sigma_q - 2(\Sigma_p^{\frac{1}{2}}\Sigma_q\Sigma_p^{\frac{1}{2}})^{\frac{1}{2}} \right).}\]

Moreover, an optimal transportation from $p$ to $q$ is the linear map

\begin{equation}\label{eqn:map} x \mapsto \mu_q + \Sigma_p^{-\frac{1}{2}}(\Sigma_p^{\frac{1}{2}}\Sigma_q\Sigma_p^{\frac{1}{2}})^{\frac{1}{2}}\Sigma_p^{-\frac{1}{2}} (x - \mu_p). \end{equation}

A proof was given by Clark R. Givens and Rae Michael Shortt. Here, I give a shorter one.

Proof. Let $x_p = \mu_p + \Sigma_p^{\frac{1}{2}}\epsilon_p$ and $x_q = \mu_q + \Sigma_q^{\frac{1}{2}}\epsilon_q$, where $\epsilon_p$ and $\epsilon_q$ follow standard normal. Then,

\[\begin{align*} \mathbb{E}[ \| x_p - x_q \|_2^2 ] &= \mathbb{E}[ \| \mu_p - \mu_q + \Sigma_p^{\frac{1}{2}}\epsilon_p - \Sigma_q^{\frac{1}{2}}\epsilon_q \|_2^2 ] \\ &= \| \mu_p - \mu_q \|_2^2 + \mathbb{E}[ \| \Sigma_p^{\frac{1}{2}}\epsilon_p - \Sigma_q^{\frac{1}{2}}\epsilon_q \|_2^2 ] \\ &= \| \mu_p - \mu_q \|_2^2 + \text{tr}(\Sigma_p) + \text{tr}(\Sigma_q) - 2\text{tr}( \Sigma_p^{\frac{1}{2}} \cdot \mathbb{E}[\epsilon_p\epsilon_q^T] \cdot \Sigma_q^{\frac{1}{2}} ). \end{align*}\]

We are to find the covariance $C := \mathbb{E}[\epsilon_p\epsilon_q^T]$ that maximizes $\text{tr}( \Sigma_p^{\frac{1}{2}} C \Sigma_q^{\frac{1}{2}} )$. To this end, we write $\text{tr}( \Sigma_p^{\frac{1}{2}} C \Sigma_q^{\frac{1}{2}} ) = \text{tr}(CM)$, where $M = \Sigma_q^{\frac{1}{2}} \Sigma_p^{\frac{1}{2}}$, and note that when $C$ is square, $C$ is a covariance matrix iff $C^TC \preceq I$. (In fact, this is also the condition when $C$ has more rows than columns; but if $C$ has more columns than rows, the iff condition is $CC^T \preceq I$ instead.)

Write the SVDs for $C$ and $M$: $C = PTQ^T$ and $M = USV^T$. Then $\text{tr}(CM) = \text{tr}(PTQ^TUSV^T) = \text{tr}(TWSZ^T)$, where $W:=Q^TU$ and $Z:=P^TV$ are orthogonal. The diagonal matrix $T$ contains the singular values of $C$. Because $C^TC \preceq I$, the singular values are all between 0 and 1. We let the columns of $W$ be $w_i$ and those of $Z$ be $z_i$. We let the singular values of $M$ (the diagonal elements of $S$) be $s_i$. We also use the vector $t$ to denote the diagonal of $T$. Then, we write the matrix, whose trace is under concern, as an outer-product sum akin to how the SVD can be written:

\[TWSZ^T = \sum_{i=1}^d s_i ( t \circ w_i) z_i^T \quad\Rightarrow\quad \text{tr}(TWSZ^T) = \sum_{i=1}^d s_i z_i^T ( t \circ w_i).\]

The outer-product sum turns to an inner-product sum under the trace operator. Because $w_i$ has a unit 2-norm and the elements of $t$ are between 0 and 1, the 2-norm of $t \circ w_i$ is bounded by 1. Then, because $z_i$ also has a unit 2-norm, we see that the inner product $z_i^T ( t \circ w_i) \le 1$. Overall,

\[\sum_{i=1}^d s_i z_i^T ( t \circ w_i) \le \sum_{i=1}^d s_i,\]

with equality holds when $t$ is a vector of all ones and $z_i = w_i$ for all $i$. In other words, $\text{tr}(TWSZ^T)$ achieves the least upper bound $\sum_{i=1}^d s_i$ when $C = PQ^T$ and $Q^TU = P^TV$. This leads to $C = PQ^T = VU^T$, the transpose of the unitary polar factor of $M$.

To summarize, $\text{tr}(CM)$ attains the maximum when $C$ is the transpose of the unitary polar factor of $M$. Denoting by $^+$ the pseudo inverse, this matrix can be written as $M^T((MM^T)^+)^{\frac{1}{2}}$, which ultimately leads to

\[\max_{C^TC \preceq I} \text{tr}(CM) = \text{tr}\left( M^T((MM^T)^+)^{\frac{1}{2}}M \right) = \text{tr}\left( (MM^T)^{\frac{1}{2}} \right) = \text{tr}\left( (\Sigma_q^{\frac{1}{2}} \Sigma_p \Sigma_q^{\frac{1}{2}})^{\frac{1}{2}} \right),\]

which completes the proof of the boxed formula.

Finally, it is straightforward to verify that the linear map \eqref{eqn:map} results in $\mathbb{E}[ | x_p - x_q |_2^2 ] = | \mu_p - \mu_q |_2^2 + \text{tr} \left( \Sigma_p + \Sigma_q - 2(\Sigma_p^{\frac{1}{2}}\Sigma_q\Sigma_p^{\frac{1}{2}})^{\frac{1}{2}} \right)$, attaining the infinum. $\qquad\square$




Enjoy Reading This Article?

Here are some more articles you might like to read next:

  • How to get the sorted array and indices in Python without importing additional modules like Numpy
  • A subtly wrong proof that ChatGPT generated
  • A few tricks in probability and statistics (3): Poissonization trick
  • A few tricks in probability and statistics (2): Poisson estimator trick
  • A few tricks in probability and statistics (1): log-derivative trick