Divergence Measures

2 minute read

Published: April 16, 2022

This post summarizes useful results on divergence measurere.

Divergence Measures

Below we use $\mathbb{P}$ and $\mathbb{Q}$ to denote two probability measures. When they are absolutely continuous with respect to the Lebesgue measure $\lambda$, the Radon–Nikodym derivatives $\frac{\mathrm{d}\mathbb{P}}{\mathrm{d}\lambda}$ and $\frac{\mathrm{d}\mathbb{Q}}{\mathrm{d}\lambda}$ reduce to its density function $p$ and $q$. Thus, the following definition of divergence measures can be expressed in terms of density functions provided that the density functions exist. Furthermore, there is no loss of generality in assuming the existence of densities, since any pair of distributions have densities with respect to the base measure $\lambda=\frac{1}{2}(\mathbb{P}+\mathbb{Q})$.

Total variation distance

The total variation (TV) distance is defined as follows: $$TV(\mathbb{P},\mathbb{Q})= \sup\limits_{A\subseteq\Omega}|\mathbb{P}(A)-\mathbb{Q}(A)|.$$ This is a very strong notion of distance. In particular, if the TV is small then the probability of any event under the two distributions must be close.

If the domain $\Omega$ is discrete and suppose we have two distributions $\mathbb{P}$ and $\mathbb{Q}$ with probability mass functions $p$ and $q$, then $$\begin{align*} TV(\mathbb{P},\mathbb{Q})&=1-\sum_{x \in \Omega} \min \{p(x), q(x)\} \\ &= \sum_{{x \in \Omega: p(x) \geq q(x)}}[p(x)-q(x)]\\ &= \frac{1}{2} \sum_{x \in \Omega}|p(x)-q(x)| \end{align*}$$ The Total Variation distance also belongs to a popular class of distances between probability distributions. It is an integral probability metric (IPM) (other popular examples of IPMs include the Wasserstein distance). The TV distance can also be written as: $$TV(\mathbb{P},\mathbb{Q})=\frac{1}{2} \sup_{||f||_{\infty} \leq 1}\left|\mathbb{E}_{X \sim \mathbb{P}}[f(X)]-\mathbb{E}_{Y \sim \mathbb{Q}} [f(Y)]\right|$$

Squared Hellinger distance

The squared Helligher distance is defined as $$H^2(\mathbb{P},\mathbb{Q})=\frac{1}{2}\int\left(\sqrt{\frac{\mathrm{d}\mathbb{P}}{\mathrm{d}\lambda}}- \sqrt{\frac{\mathrm{d}\mathbb{Q}}{\mathrm{d}\lambda}}\right)^2\mathrm{d}\lambda$$ We can also define it for corresponding density function (or probability mass functions) $p$ and $q$ as $$ H^{2}(p,q)={\frac {1}{2}}\int \left({\sqrt {p(x)}}-{\sqrt {q(x)}}\right)^{2}\mathrm{d}x=1-\int {\sqrt {p(x)q(x)}}\mathrm{d}x.$$ The Helligher distance and the squared Helligher distance are bounded by 0 and 1: $$0\leq H^2(\mathbb{P},\mathbb{Q})\leq H(\mathbb{P},\mathbb{Q})\leq 1.$$

Kullback–Leibler divergence

$$KL(\mathbb{P}, \mathbb{Q}) = \mathbb{E}_{\mathbb{P}}\left[\log\frac{\mathrm{d}\mathbb{P}}{\mathrm{d}\mathbb{Q}}\right]$$

$$KL(p, q) = - \int p(x) \log q(x) dx + \int p(x) \log p(x) dx$$

Chi-square divergence

Properties

Comparison

$$H^2(\mathbb{P},\mathbb{Q})\leq TV(\mathbb{P},\mathbb{Q})\leq \sqrt{2} H(\mathbb{P},\mathbb{Q}) \lesssim\sqrt{KL(\mathbb{P},\mathbb{Q})} \lesssim \sqrt{\chi^2(\mathbb{P},\mathbb{Q})}$$

Tenserization

If $X_1,\ldots,X_n\overset{iid}{\sim}\mathbb{P}$, then they are associated with the product measure $\mathbb{P}^n$. What is the relationship between divergence of the produce measure and the individual measure? It turns out that for the KL divergence and Hellinger distance we have $$\begin{align*} KL(\mathbb{P}^n, \mathbb{Q}^n) &= n KL(\mathbb{P}, \mathbb{Q})\\ H^2(\mathbb{P}^n,\mathbb{Q})^n &\leq n H^2(\mathbb{P},\mathbb{Q}). \end{align*}$$

Example

Suppose $p$ and $q$ are the density function of $\mathcal{N}(\mu_1,\sigma_1^2)$ and $\mathcal{N}(\mu_2,\sigma_2^2)$ respectively, then $$\begin{align} KL(p, q) &= - \int p(x) \log q(x) dx + \int p(x) \log p(x) dx\\ &=\frac{1}{2} \log (2 \pi \sigma_2^2) + \frac{\sigma_1^2 + (\mu_1 - \mu_2)^2}{2 \sigma_2^2} - \frac{1}{2} (1 + \log 2 \pi \sigma_1^2)\\ &= \log \frac{\sigma_2}{\sigma_1} + \frac{\sigma_1^2 + (\mu_1 - \mu_2)^2}{2 \sigma_2^2} - \frac{1}{2} \end{align}$$

Share on

Twitter Facebook LinkedIn

Divergence Measures

Divergence Measures

Total variation distance

Squared Hellinger distance

Kullback–Leibler divergence

Chi-square divergence

Properties

Comparison

Tenserization

Example

Share on

You May Also Enjoy

Academic Job Interview

Screening interview questions

Python Handbook

Conda

Norm Inequality

Reproducing Kernel Hilbert Spaces