Uniform Laws
Published:
This post summarizes empirical process theory and statistical functional estimations.
Uniform Laws
Uniform Convergence of the CDF
To estimate the cdf of a univariate random variable, one would naturally use the empirical cdf, i.e., $$\hat{F}_n(x)=\frac{1}{n}\sum\limits_{i=1}^n\mathbf{1}\{X_i\leq x\}$$ based on samples $X_1,\ldots,X_n\sim F$. For each fixed $x\in\mathcal{X}$, the empirical cdf at $x$ is a consistent estimate of $F(x)$: $$\mathbb{E}[\hat{F}_n(x)]=F(x),\ \forall\ x\in\mathcal{X}.$$ Then we can use the Hoeffdling bound to show that $$\mathbb{P}(|\hat{F}_n(x)-F(x)|\geq \epsilon)\leq 2\exp(-2n\epsilon^2),\ \forall\ x\in\mathcal{X}.$$ The difficulty is to show that the difference of them is small everywhere uniformly. The following theorem gives us the answer:
(Glivenko-Cantelli Theorem) For any distribution $F$, $\Delta\triangleq\sup\limits_{x\in\mathcal{X}}|\hat{F}_n(x)-F(x)|\xrightarrow{p}0$.
The GC theorem is a uniform analog to the WLLN theorem. There is a corresponding GC theorem that guarantees convergence almost surely.
Empirical Process Theory
More generally, we have the following equivalent forms of such uniform bounds or uniform laws.
- $\Delta(\mathcal{A})\triangleq\sup\limits_{A\in\mathcal{A}}\left|\frac{1}{n}\sum\limits_{i=1}^n\mathbf{1}\{X_i\in A\}-\mathbb{P}(A)\right|$ for a collection of set $\mathcal{A}$, which forms the basis for what is called Vapnik-Cervonenkis theory.
- $\Delta(\mathcal{F})\triangleq\sup\limits_{f\in\mathcal{F}}\left|\frac{1}{n}\sum\limits_{i=1}^nf(X_i)-\mathbb{E}[f(X)]\right|$ for a class of function $\mathcal{F}$, which is known as an empirical process.
When $|\mathcal{F}|$ (or $|\mathcal{A}|$) is finite, we can just use union bound to obtain uniform laws. To extend to more general setting when the class has infinite many elements, we need some complexity measure of the function class. But when the class is too big, the uniform laws may fail.
Estimation of Statistical Functionals
Often we want to estimate some quantity which can be written as a simple function of the cdf. This can be done by replacing the cdf by the empirical cdf, which yields a plug-in estimator for the quantity. A statistical functional is a functional (a function of a function) of the cdf.
- Expectation/Moment functionals
- Quantitle functionals.
- Goodness-of-fit functionals
For some functional $\gamma$ statisfing the continuity condition that for every $\epsilon>0$, there exists a $\delta>0$, such that if $\Delta(\mathcal{F})\leq\delta$ then $|\gamma(F)-\gamma(\hat{F})|\leq\epsilon$, the GC theorem implies that $\gamma(\hat{F})\xrightarrow{p}\gamma(F)$.
Uniform Laws on Binary Classification
Empirical Risk Minimization
For binary classification, we want to find a function from the class of functions $f:\mathcal{R}^d\rightarrow\{-1,+1\}$. The error of $f$ is measured by the empirical risk: $$\hat{R}_n(f)=\frac{1}{n}\sum\limits_{i=1}^n\mathbf{1}\{f(X_i)\neq y_i\}$$ over a training set $\{(X_1, y_1),\ldots,(X_n, y_n)\}$ from some distribution of $(X,y)$, where $y_i$'s are binary. The true risk of $f$ is given by $R(f)=\mathbb{P}(f(X)\neq y)$.
If $f$ is fixed, then by Hoeffding's bound, $$\mathbb{P}(\hat{R}_n(f)-R(f))\leq 2 \exp(-2nt^2).$$ However, note that if $f$ is picked from samples $$\hat{f}=\mathop{\arg\min}_{f\in\mathcal{F}} \hat{R}_n(f),$$ then the above argument does not work.
Let $f^{\ast}$ be the best classifier in $\mathcal{F}$ that minimize the true risk. We would want to minimize the excess risk $$\begin{align} \Delta&=R(\hat{f}) - R(f^{\ast})\\ &=\underbrace{R(\hat{f})-\hat{R}_n(\hat{f})}_{T_1} + \underbrace{\hat{R}_n(\hat{f}) - \hat{R}_n(f^{\ast})}_{T_2} +\underbrace{\hat{R}_n(f^{\ast})-R(f^{\ast})}_{T_3} \end{align}$$ Since $\hat{f}$ minimizes the empirical risk we know that $T_2 \leq 0$. We know that $T_3$ is small just by the Hoeffding argument from before, since $f^{\ast}$ is a fixed classifier. The key point is the $\hat{f}$ is data dependent so its empirical risk is not the sum of independent RVs. To bound $T_1$, we can use uniform laws, with high probability $$T_1\leq \sup\limits_{f\in\mathcal{F}}R(f)-\hat{R}_n(f)\leq \Theta.$$ Then with high probability, we have $\Delta\leq \Theta + \sqrt{\frac{2\log(2/\delta)}{n}}$.
VC Dimension
We say that $\mathcal{A}$ can shatter a finite set of $n$ points ${x_1,\ldots,x_n}$ if $|{{x_1,\ldots,x_n}\cap A:A\in\mathcal{A}}|=2^n$, i.e., the points can be arbitrary labeled into two classes by functions $\mathbf{1}_{A}(x)$ for $A\in\mathcal{A}$.
We define the shatter coefficient of $\mathcal{A}$ to be $$s(\mathcal{A},n)=\max\limits_{{x_1,\ldots,x_n}} |{{x_1,\ldots,x_n}\cap A:A\in\mathcal{A}}|,$$ which is upper bounded by $2^n$.
(VC Theorem) For any distribution $\mathbb{P}$ and class of sets $\mathcal{A}$ we have that $$\mathbb{P}(\Delta(\mathcal{A})\geq t)\leq 8 s(\mathcal{A},n)\exp(-nt^2/32),$$ where $n$ is the sample size.
The VC dimension $d$ is defined as the largest integer $n$ such that $s(\mathcal{A},n)=2^n$. That is, when $n>d$, $\mathcal{A}$ cannot shatter all collection of $n$ points and hence $s(\mathcal{A},n)<2^n$. In fact, the shatter coefficient grows polynomially in $n$ instead of exponentially, as shown in the following lemma.
(Sauer's Lemma) If $\mathcal{A}$ has finite VC dimension $d$, then for $n>d$, we have $s(\mathcal{A},n)\leq (n+1)^d$.
Rademacher Complexity
Given a fixed collection of points ${x_1,\ldots,x_n}$ and let $\epsilon={\epsilon_1,\ldots,\epsilon_n}$ be a colloection of $n$ Rademarcher random variables which take the value in ${+1,-1}$ with equal probability. Then the empirical Rademacher complexity is defined as $$R(x_1,\ldots,x_n)=\mathbb{E}_{\epsilon}\left[\sup\limits_{f\in\mathcal{F}}\left|\frac{1}{n}\sum\limits_{i=1}^n\epsilon_i f(x_i)\right|\right].$$ If we further consider the randomness of samples ${X_1,\ldots,X_n}$, we can define the Rademacher complexity as $$R(\mathcal{F})=\mathbb{E}_{X,\epsilon}\left[\sup\limits_{f\in\mathcal{F}}\left|\frac{1}{n}\sum\limits_{i=1}^n\epsilon_i f(X_i)\right|\right]. $$ Unlike the VC dimension, the Rademacher complexity is not a worst case measure of complexity.
(Rademacher Theorem) $\mathbb{E}[\Delta(\mathcal{F})]\leq 2 R(\mathcal{F})$.
For a finite class $\mathcal{F}=\{f_1,\ldots,f_N\}$ that are bounded $\|f_i\|_{\infty}\leq b$, the Rademacher complexity is upper bounded, $R(\mathcal{F})\leq 2b\sqrt{\frac{\log(2N)}{n}}$. One implication from this is that we can use Rademacher Theorem to obtain the VC theorem, since the indicator functions are bounded.