Divergence Measures

2 minute read

Published:

This post summarizes useful results on divergence measurere.

Divergence Measures

Below we use $\mathbb{P}$ and $\mathbb{Q}$ to denote two probability measures. When they are absolutely continuous with respect to the Lebesgue measure $\lambda$, the Radon–Nikodym derivatives $\frac{\mathrm{d}\mathbb{P}}{\mathrm{d}\lambda}$ and $\frac{\mathrm{d}\mathbb{Q}}{\mathrm{d}\lambda}$ reduce to its density function $p$ and $q$. Thus, the following definition of divergence measures can be expressed in terms of density functions provided that the density functions exist. Furthermore, there is no loss of generality in assuming the existence of densities, since any pair of distributions have densities with respect to the base measure $\lambda=\frac{1}{2}(\mathbb{P}+\mathbb{Q})$.

Total variation distance

The total variation (TV) distance is defined as follows: $$TV(\mathbb{P},\mathbb{Q})= \sup\limits_{A\subseteq\Omega}|\mathbb{P}(A)-\mathbb{Q}(A)|.$$ This is a very strong notion of distance. In particular, if the TV is small then the probability of any event under the two distributions must be close.

If the domain $\Omega$ is discrete and suppose we have two distributions $\mathbb{P}$ and $\mathbb{Q}$ with probability mass functions $p$ and $q$, then $$\begin{align*} TV(\mathbb{P},\mathbb{Q})&=1-\sum_{x \in \Omega} \min \{p(x), q(x)\} \\ &= \sum_{{x \in \Omega: p(x) \geq q(x)}}[p(x)-q(x)]\\ &= \frac{1}{2} \sum_{x \in \Omega}|p(x)-q(x)| \end{align*}$$ The Total Variation distance also belongs to a popular class of distances between probability distributions. It is an integral probability metric (IPM) (other popular examples of IPMs include the Wasserstein distance). The TV distance can also be written as: $$TV(\mathbb{P},\mathbb{Q})=\frac{1}{2} \sup_{||f||_{\infty} \leq 1}\left|\mathbb{E}_{X \sim \mathbb{P}}[f(X)]-\mathbb{E}_{Y \sim \mathbb{Q}} [f(Y)]\right|$$

Squared Hellinger distance

The squared Helligher distance is defined as $$H^2(\mathbb{P},\mathbb{Q})=\frac{1}{2}\int\left(\sqrt{\frac{\mathrm{d}\mathbb{P}}{\mathrm{d}\lambda}}- \sqrt{\frac{\mathrm{d}\mathbb{Q}}{\mathrm{d}\lambda}}\right)^2\mathrm{d}\lambda$$ We can also define it for corresponding density function (or probability mass functions) $p$ and $q$ as $$ H^{2}(p,q)={\frac {1}{2}}\int \left({\sqrt {p(x)}}-{\sqrt {q(x)}}\right)^{2}\mathrm{d}x=1-\int {\sqrt {p(x)q(x)}}\mathrm{d}x.$$ The Helligher distance and the squared Helligher distance are bounded by 0 and 1: $$0\leq H^2(\mathbb{P},\mathbb{Q})\leq H(\mathbb{P},\mathbb{Q})\leq 1.$$

Kullback–Leibler divergence

$$KL(\mathbb{P}, \mathbb{Q}) = \mathbb{E}_{\mathbb{P}}\left[\log\frac{\mathrm{d}\mathbb{P}}{\mathrm{d}\mathbb{Q}}\right]$$

$$KL(p, q) = - \int p(x) \log q(x) dx + \int p(x) \log p(x) dx$$

Chi-square divergence

Properties

Comparison

$$H^2(\mathbb{P},\mathbb{Q})\leq TV(\mathbb{P},\mathbb{Q})\leq \sqrt{2} H(\mathbb{P},\mathbb{Q}) \lesssim\sqrt{KL(\mathbb{P},\mathbb{Q})} \lesssim \sqrt{\chi^2(\mathbb{P},\mathbb{Q})}$$

Tenserization

If $X_1,\ldots,X_n\overset{iid}{\sim}\mathbb{P}$, then they are associated with the product measure $\mathbb{P}^n$. What is the relationship between divergence of the produce measure and the individual measure? It turns out that for the KL divergence and Hellinger distance we have $$\begin{align*} KL(\mathbb{P}^n, \mathbb{Q}^n) &= n KL(\mathbb{P}, \mathbb{Q})\\ H^2(\mathbb{P}^n,\mathbb{Q})^n &\leq n H^2(\mathbb{P},\mathbb{Q}). \end{align*}$$

Example

Suppose $p$ and $q$ are the density function of $\mathcal{N}(\mu_1,\sigma_1^2)$ and $\mathcal{N}(\mu_2,\sigma_2^2)$ respectively, then $$\begin{align} KL(p, q) &= - \int p(x) \log q(x) dx + \int p(x) \log p(x) dx\\ &=\frac{1}{2} \log (2 \pi \sigma_2^2) + \frac{\sigma_1^2 + (\mu_1 - \mu_2)^2}{2 \sigma_2^2} - \frac{1}{2} (1 + \log 2 \pi \sigma_1^2)\\ &= \log \frac{\sigma_2}{\sigma_1} + \frac{\sigma_1^2 + (\mu_1 - \mu_2)^2}{2 \sigma_2^2} - \frac{1}{2} \end{align}$$