Hypothesis Testing

8 minute read

Published:

This post summarizes hypothesis testing, confidence interval, bootstrap, and Bayesian inference.

Hypothesis Testing Framework

Suppose we draw independent $X_1,\ldots,X_n\sim p(X;\theta)$, and we want to test whether the parameter $\theta$ is in some set. For simplicity, we denote $X_1,\ldots,X_n$ by $X_{1:n}$. We formalize this by stating a null hypothesis $H_0$ and an alternative hypothesis $H_1$: $$H_0:\theta\in\Theta_0\text{ v.s. }H_1:\theta\in\Theta_1.$$ where $\Theta_0\cap\Theta_1=\varnothing$. If $\theta_0$ consists of a single point, we call this a simple null hypothesis. If $\theta_0$ consists of more than one point, we call this a composite null hypothesis. We can define simple and composite alternative hypotheses analogously.

The outcomes can be:

$\ $Not reject $H_0$Reject $H_0$
$H_0$ true$\ $Type I error
(false positive)
$H_1$ trueType II error
(false negetive)
$\ $

In general, our goal will be controlling the type I error up to some level $\alpha$, while trying to maximize the power = 1 - Type II error when $H_1$ is true.

Hypothesis testing involves the following steps:

  1. Choose a test statistic $T_n= T(X_1, \ldots , X_n)$.
  2. Choose a rejection region $\mathcal{R} \subseteq \mathcal{X}^n$ which is related to $T_n$.
  3. If $(X_1, \ldots , X_n) \in \mathcal{R}$ we reject $H_0$ otherwise we retain $H_0$.

We need to choose $T$ and $R$ so that the test has good statistical properties.

Evaluation of Tests

We define the power function by $$\beta(\theta)=\mathcal{P}_{\theta}(X_1,\ldots,X_n\in\mathcal{R}).$$ Thus, we want $\beta(\theta)$ to be small when $\theta\in\Theta_0$ and to be large when $\theta\in\Theta_1$.

A test is size $\alpha$ if $\sup\limits_{\theta\in\Theta_0}\beta(\theta)=\alpha$ and is level $\alpha$ if $\sup\limits_{\theta\in\Theta_0}\beta(\theta)\leq\alpha$.

The $p$-value is the smallest $\alpha$ at which we would reject $H_0$. Hence, to test at level $\alpha$, we reject when $p < \alpha$.

Suppose we have a test of the form: reject when $T(X_{1:n}) > c$. Then the $p$-value is $$p=\sup\limits_{\theta\in\Theta_0}\mathbb{P}_{\theta}(T_n(X_{1:n})\geq T_n(x_{1:n}))$$ where $x_{1:n}$ are the observed data. Note that in this scenario the $p$-values are fixed numbers, while they are random variables if (1) we treat it as a function of $x_{1:n}$, or (2) in the Bayesian two-group model used for multiple testing.

Common Tests

The Neyman-Pearson Test

(Neyman-Pearson Lemma) For simple null and alternative hypotheses $$H_0:\theta=\theta_0\text{ v.s. }H_1:\theta=\theta_1,$$ let $L(\theta;X_{1:n}) = p(X_{1:n}; \theta)$ and $T_n=\frac{L(\theta_1;X_{1:n})}{L(\theta_0;X_{1:n})}$. Consider the Neyman-Pearson (NP) test that we reject $H_0$ if $T_n > t_{\alpha}$ where $t_{\alpha}$ is chosen so that $\mathbb{P}(X_{1:n}\in\mathcal{R})=\alpha$. Then the NP test is a uniformly most powerful (UMP) level $\alpha$ test, i.e., if $\beta'$ is the power function of any other level $\alpha$ test, then $\beta(\theta)\geq\beta'(\theta)$ for all $\theta\in\Theta_1$.

Proof

Denote the NP test by $\phi_{NP}(X_{1:n})=\begin{cases}1 &,\ T_n>t\\0 &,\ T_n\leq t\end{cases}$ and other test by $\phi_A)$. Note that $$\int_{\mathcal{X}}[\phi_{NP}(x_{1:n})-\phi_{A}(x_{1:n})][L(\theta_1;x_{1:n})-t_{\alpha}L(\theta_0;x_{1:n})]\mathrm{d}x_{1:n} \geq 0.$$ Then the difference of powers is given by $$\begin{align*} &\mathbb{E}_{\theta_1}[\phi_{NP}(X_{1:n})]- \mathbb{E}_{\theta_1}[\phi_{A}(X_{1:n})]\\ =& \int_{\mathcal{X}}\phi_{NP}(x_{1:n})L(\theta_1;x_{1:n})\mathrm{d}x_{1:n} - \int_{\mathcal{X}}\phi_{A}(x_{1:n})L(\theta_1;x_{1:n})\mathrm{d}x_{1:n} \\ \geq& t_{\alpha}\left[\underbrace{\int_{\mathcal{X}}\phi_{NP}(x_{1:n})L(\theta_0;x_{1:n})\mathrm{d}x_{1:n}}_{=\alpha} - \underbrace{\int_{\mathcal{X}}\phi_{A}(x_{1:n})L(\theta_0;x_{1:n})\mathrm{d}x_{1:n}}_{\leq\alpha}\right]\\ \geq & 0 \end{align*},$$ which completes the proof.

The Wald Test

To deal with a possibly composite alternative hypothesis, $$H_0:\theta=\theta_0\text{ v.s. }H_1:\theta\neq\theta_0,$$ we can use the Wald test. Based on an asymptotically normal estimator, we can use the statistic $T_n=\frac{\hat{\theta}_n-\theta_0}{\sigma_0}$, or if $\sigma_0$ is not known we can plug-in an estimate to obtain the statistic, $T_n=\frac{\hat{\theta}_n-\theta_0}{\hat{\sigma_0}}$. Under the null $T_n\xrightarrow{d} \mathcal{N}(0, 1)$, so we simply reject the null if: $| T_n| \geq z_{1-\alpha/2}$.

Properties:

  1. Asymptotic Type I error control $\lim\limits_{n\rightarrow\infty}\mathbb{P}_{\theta_0}(|T_n|\geq z_{1-\alpha/2})=\alpha$
  2. Power $\lim\limits_{n\rightarrow\infty}\beta(\theta_1)= \Phi(\Delta-z_{1-\alpha/2})+\Phi(-\Delta-z_{1-\alpha/2})$ where $\Delta=\sqrt{n\mathcal{I}_1(\theta)}(\theta_1-\theta_0)$.

The Likelihood Ratio Test (LRT)

To test composite versus composite hypotheses $$H_0:\theta\in\Theta_0\text{ v.s. }H_1:\theta\in\Theta_1,$$ we can use the (generalized) likelihood ratio test. Define the likelihood ratio to be $$\lambda(X_{1:n}) = \frac{\sup\limits_{\theta\in\Theta_0}L(\theta;X_{1:n})}{\sup\limits_{\theta\not\in\Theta_0\cup\Theta_1}L(\theta;X_{1:n})}.$$

(Wilk's Theorem) If $\Theta_0\subseteq\Theta\subseteq\mathbb{R}^d$ and $\Theta_1=\mathbb{R}^d\setminus\Theta_0$ with dim$(\Theta_0)=k$, then under $H_0$ $$-2\log\lambda(X_{1:n}) \xrightarrow{d}\chi^2_{d-k}.$$ Let $T_n=-2\log\lambda(X_{1:n})$, then we reject $H_0$ if $T_n\geq \chi^2_{d-k,1-\alpha}$.

The Permutation Test

The permutation test is a nonparametric method for testing whether two distributions are the same: $$H_0:F_X=F_Y\text{ v.s. }H_1:F_X\neq F_Y.$$ Suppose we have sample $X_{1:m}$ and $Y_{1:n}$.

  1. Compute the observed value of the test statistic $T_0=T(X_{1:m},Y_{1:n})$.
  2. Randomly permute the data. Compute the statistic again using the permuted data.
  3. Repeat the previous step $B$ times and let $T_1, \ldots , T_B$ denote the resulting values.
  4. The approximate $p$-value is $\frac{1}{B}\sum\limits_{j=1}^B\mathbf{1}\{T_j>T_0\}$.

Universal Inference

The main drawback of LRT is that in many cases it is very difficult to characterize the distribution (under the null) of the statistic.

  1. Split data into two $\mathcal{D}_0$ and $\mathcal{D}_1$ where $D_0$ has the first $n$ samples and $D_1$ has the rest.
  2. On $\mathcal{D}_0$, we compute the null MLE $$\hat{\theta}_0\in\mathop{\arg\max}\limits_{\theta\in\Theta_0}L(\theta;\mathcal{D}_0).$$
  3. On $\mathcal{D}_1$, we compute the alternative MLE $$\hat{\theta}_1\in\mathop{\arg\max}\limits_{\theta\in\Theta_1}L(\theta;\mathcal{D}_1).$$
  4. Compute the statistic $U_n=\frac{L(\hat{\theta}_0;\mathcal{D}_0)}{L(\hat{\theta}_1;\mathcal{D}_0)}$ and reject if $U_n\leq\alpha$.

The universal test, correctly controls the Type I error at $\alpha$, i.e., $$\sup\limits_{\theta\in\Theta_0}\mathbb{P}_{\theta}(U_n\leq\alpha)\leq\alpha.$$

Proof

By Markov inequality, we have $$\begin{align*} \mathbb{P}_{\theta}\left(U_n^{-1}\geq\frac{1}{\alpha}\right)&\leq \alpha\mathbb{E}_{\theta}\left[\frac{L(\hat{\theta}_1;\mathcal{D}_0)}{L(\hat{\theta}_0;\mathcal{D}_0)}\right]\\ &\leq \alpha \mathbb{E}_{\theta}\left[\frac{L(\hat{\theta}_1;\mathcal{D}_0)}{L(\theta;\mathcal{D}_0)}\right]\\ &\leq\alpha \end{align*},$$ where the last inequality comes from $$\begin{align*} \mathbb{E}_{\theta}\left[\frac{L(\hat{\theta}_1;\mathcal{D}_0)}{L(\theta;\mathcal{D}_0)}\right]&=\mathbb{E}_{\mathcal{D}_1}\left[\mathbb{E}_{\theta}\left[\frac{L(\hat{\theta}_1;\mathcal{D}_0)}{L(\theta;\mathcal{D}_0)}\mid\mathcal{D}_1\right]\right]\\ &=\mathbb{E}_{\mathcal{D}_1}\left[\int_{\mathcal{X}^n}\frac{p(X_{1:n};\hat{\theta}_1)}{p(X_{1:n};\theta)}p(X_{1:n};\theta)\mathrm{d}X_{1:n}\right]\\ &=\mathbb{E}_{\mathcal{D}_1}\left[\int_{\mathcal{X}^n}p(X_{1:n};\hat{\theta}_1)\mathrm{d}X_{1:n}\right]\\ &\leq 1. \end{align*}$$

Confidence Sets/Intervals

Suppose that $X_{1:n}\sim p(X;\theta)$ where $\theta\in\Theta$. A $1-\alpha$ honest confidence set / confidence interval (CI) $C_{\alpha}(X_{1:n})$ is defined to satisfy that $$\inf\limits_{\theta\in\Theta}\mathbb{P}_{\theta}(\theta\in C_{\alpha}(X_{1:n}))\geq 1-\alpha.$$

Duality between CIs and Tests

On the one hand, we can invert a simple test $$H_0:\theta=\theta_0\text{ v.s. }H_1:\theta\neq\theta_0,$$ to a CI $C(X_{1:n})={\theta:X_{1:n}\neq \mathcal{R}(T_n)}$ where $\mathcal{R}(T_n)$ is the rejection region defined based on the test statistic $T_n$. This is because $\Theta=\{\theta_0\}$ and $$\mathbb{P}_{\theta_0}(\theta_0\not\in C(X_{1:n}))=\mathbb{P}_{\theta_0}(X_{1:n}\not\in\mathcal{R}(T_n))\leq\alpha.$$

On the other hand, we can also invert a CI $C(X_{1:n})$ to a composite test $$H_0:\theta\in\Theta_0\text{ v.s. }H_1:\theta\in\Theta_1,$$ and reject the null if $C(X_{1:n})\cap\Theta_0=\varnothing$. This controls the Type I error $$\sup\limits_{\theta\in\Theta_0}\mathbb{P}_{\theta}(X_{1:n}\not\in\mathcal{R}(T_n))=\sup\limits_{\theta\in\Theta_0}\mathbb{P}_{\theta}(\theta\not\in C(X_{1:n}))\leq\alpha.$$

Exact Finite-Sample CIs

We can use tail bounds to derive confidence intervals, but they are generally undesirable for the following reasons:

  1. We do not always have tail bounds for estimators of interest.
  2. There are usually imprecisely known constants in tails bounds.
  3. Most importantly, they are very conservative.

Pivots

A pivot is a function of the data and the unknown parameter $Q(X_{1:n}, \theta)$, whose distribution does not depend on $\theta$. If we know the pivots, we can construct CIs based on them.

  1. Find $a,b$ such that $\mathbb{P}(a\leq Q(X_{1:n}, \theta)\leq b)\geq 1-\alpha$.
  2. Construct the CI as $C(X_{1:n})=\{\theta: a\leq Q(X_{1:n}, \theta)\leq b\}$.

Bootstrap

Suppose that $X_{1:n}\sim \mathcal{P}$ and we have an estimator $\hat{\theta}_n$ for some quantity $\theta$. We want to estimate some function of $\hat{\theta}_n$ under $\mathcal{P}$. For example, the mean $\mathbb{E}_{\mathcal{P}}[\hat{\theta}_n]$ and the variance $\text{Var}_{\mathcal{P}}[\hat{\theta}_n]$. If we can sample from $\mathcal{P}$, then the Monte Carlo estimator can be used to estimate it. The idea of bootstrapping is to replace $\mathcal{P}$ by the empirical distribution $\mathcal{P}_n$ and generate copies of estimators.

Bootstrap mean estimator:

  1. Draw a bootstrap sample $X_{b,1:n}^{\ast}\sim\mathcal{P}_n$ and compute $\hat{\theta}_{n,b}^{\ast}$ for $b=1,\ldots,B$.
  2. Compute the estimator $\hat{\overline{\theta}}_n=\frac{1}{B}\sum\limits_{b=1}^B\hat{\theta}_{n,b}^{\ast}$.

Bootstrap variance estimator:

  1. Draw a bootstrap sample $X_{b,1:n}^{\ast}\sim\mathcal{P}_n$ and compute $\hat{\theta}_{n,b}^{\ast}$ for $b=1,\ldots,B$.
  2. Compute the estimator $\hat{s}_n^2=\frac{1}{B}\sum\limits_{b=1}^B(\hat{\theta}_{n,b}^{\ast}- \hat{\overline{\theta}}_n)^2$.

The bootstrap samples can also be used to obtain confidence intervals, which can be used for hypothesis testings. If we know $\sqrt{n}(\hat{\theta}_n-\theta)\xrightarrow{d}G$, then a valid CI is given by $[\hat{\theta}_n-\frac{g_{1-\alpha/2}}{\sqrt{n}}, \hat{\theta}_n+\frac{g_{\alpha/2}}{\sqrt{n}}]$ where $\frac{g_{\alpha/2}}{\sqrt{n}}$ and $\frac{g_{1-\alpha/2}}{\sqrt{n}}$ are the $\alpha/2$ and $1-\alpha/2$ quantiles of $G$. We approximate $G$ by the bootstrap quantile estimate $\hat{G}(t)=\frac{1}{B}\sum\limits_{b=1}^B\mathbf{1}\{\sqrt{n}(\hat{\theta}_n-\theta)\leq t\}$. Then the bootstrap CI is given by $[\hat{\theta}_n-\frac{\hat{g}_{1-\alpha/2}}{\sqrt{n}}, \hat{\theta}_n+\frac{\hat{g}_{\alpha/2}}{\sqrt{n}}]$

The validity of bootstrap is justifyied by the Bootstrap Theorem under some conditions. But the bootstrap estimator may fail in some cases. For exmaple, the bootstrap estimator and CI for MLE of Uniform$(0,\theta)$ would be biased.

Bayesian Inference

A comparison of Bayesian and frequentist inference from Larry:

$\ $BayesianFrequentist
Probabilitysubjective degree of belieflimiting frequency
Goalanalyze beliefscreate procedures with frequency guarantees
$\theta$random variablefixed
$X$random variablerandom variable
Use Bayes' theorem?Yes. To update beliefs.Yes, if it leads to procedure with good frequentist behavior. Otherwise no.

A $1-\alpha$ credible set $C$, a Bayesian analogue of a confidence set, is a fixed set such that $\mathbb{P}_{\theta\mid X_{1:n}}(\theta\in C)\geq 1-\alpha$. One can have different viewpoints on the credible sets, either logical Bayesian approach or frequency property such as the consistency/convergence rate and coverage.

(Bernstein-von Mises Theorem) When $d$ is fixed, under the assumption that the prior is continuous and (strictly) positive in a neighborhood around the true parameter $\theta^*$, the posterior is close to a Gaussian distribution as $n\rightarrow\infty$ in terms of total variation distance $$\|\pi(\theta\mid X_{1:n}) - \mathcal{N}(\hat{\theta},[n\mathcal{I}_1(\hat{\theta})]^{-1})\|_{TV}\rightarrow 0$$ where $\hat{\theta}$ is the MLE.