Statistical Model And Statistics
Published:
This post summarizes statistical models, statistics and its properties.
Statistical Models
In general, statistical inference is based on independent identically distributed (iid) samples $X_1,\ldots,X_n\sim\mathcal{P}$ where a collection of distributions $\mathcal{P}$ is called a statistical model.
Typically, people are interested in the following two classes of statistical models:
- Parametric models: the model is characterized by a finite set of parameters $\{\mathcal{P}:p(X;\theta),\theta\in\Theta\}$ where $\Theta\subseteq\mathbb{R}^d$ and $d<\infty$.
- Non-parametric models: the model cannot be characterized by a finite set of parameters. For example,
- The set of all possible distributions on $\mathbb{R}$.
- The set of distributions with smooth (e.g., with square integrable second derivative) densities.
We will focus on parametric models on estimation and inference. (We ignore the underlying probability space, which makes our life easier but may be less riguous.)
Statistics
A statistic $T=T(X_1,\ldots,X_n)$ is a function of samples that maps an experiment onto another with the same parameter space and is measurable. Such maps can lose information about the orignal expeirments. Hence we are interested in statistics that have the the property of sufficiency.
Sufficiency
A statistic $T(X_1, \ldots , X_n)$ is said to be sufficient for the parameter $\theta$ if the conditional distribution $p(X_1 , \ldots, X_n \mid T(X _1 , \ldots , X_n ) = t;\theta)$ does not depend on $\theta$ for any value of $t$. The sufficiency is beneficial for data reduction and risk reduction.
(Fisher-Neyman Factorization Theorem) A statistic $T(X)$ is sufficient if and only if there exists functions $g(t,\theta)$ and $h(x)$ such that $$p_{\theta}(x)=g(T(x),\theta)h(x),\ \forall\ x\in\mathcal{X},\ \theta\in\Theta.$$
Partition viewpoint: Partitioned on $T$, $p_{\theta}(x_1,\ldots,x_n\mid T=t)$ does not depend on $\theta$.
Risk reduction viewpoint:
(Rao-Blackwell Theorem) Let $X_1,\ldots,X_n\sim p(X;\theta)$ and $\hat{\theta}=\hat{\theta}(X_1,\ldots,X_n)$ is an estimate of $\theta$. Define the risk of $\hat{\theta}$ to be $R(\hat{\theta},\theta)=\mathbb{E}[(\hat{\theta}-\theta)^2]$. If $T$ is a sufficient staistic, then $R(\tilde{\theta},\theta)\leq R(\hat{\theta},\theta)$ where $\tilde{\theta}=\mathbb{E}[\hat{\theta}\mid T]$.
Minimal Sufficiency
We say $T$ is minimal sufficient if for any sufficient statistic $S$, $T(x_1,\ldots,x_n)=g(S(x_1,\ldots,x_n))$ for some function $g$, i.e. $T$ is a function of $S$.
Minimal sufficient statistic (MSS) can be shown to exist under weak assumptions, but exceptions are possible. If we define continuous random variables on $\mathbb{R}^n$ to be random variables that are absolutely continuous with respect to Lebesgue measure, then any of the families of continuous random variables have MSS.
Partition Viewpoint: $T$ is minimal sufficient if for any sufficient statistic $S$, $S(x_1,\ldots,x_n)=S(y_1,\ldots,y_n)$ implies $T(x_1,\ldots,x_n)=T(y_1,\ldots,y_n)$. Although minimal sufficient statistics are not unique they induce a unique partition on data.
(Lehmann & Scheffe Theorem) Let $\Theta_x=\{\theta:p_{\theta}(x)>0\}$. Assume $\Theta_x\neq\varnothing$ for every $x\in\mathcal{X}$. Suppose that there exists a statistic $T$ such that for every two samples $x$ and $y$ we have (1) $\Theta_x=\Theta_y$ and $R_{x,y}=\frac{p(x_1,\ldots,x_n;\theta)}{p(y_1,\ldots,y_n;\theta)}$ is constant as a function of $\theta\in\Theta_x$, (2) $T(x)=T(y)$, are equivalent. If this holds, then $T$ is a minimal sufficient statistic.
Ancillary
A statistic $V$ is called ancillary for $\theta$ if its distribution is free of $\theta$.
Completeness
A statistic $T$ is complete if whenever $\mathbb{E}_{\theta} [g(T)]=0$ for all $\theta\in\Theta$ for some function $g$ not depending on $\theta$, we have $\mathbb{P}_{\theta} (g(T)=0)=1$ for all $\theta\in\Theta$.
Relationship among Properties
(Basu's Theorem) If $T$ is a complete and sufficient statistic, then any ancillary statistic $V$ is independent of $T$.
A complete and sufficient statistic $T$ must be minimal sufficient.
If a minimal sufficient statistic exists, then any complete sufficient statistic $T$ must be minimal sufficient.
Exponential Family
(Completeness and Minimal Sufficiency of Exponential Family) Let $X_1,\ldots,X_n$ be iid samples from $\mathcal{P}$ with pmf/pdf $$p_{\theta}(x)=h(x)\exp\left\{\sum\limits_{i=1}^k\eta_i(\theta)t_i(x)-A(\theta)\right\}.$$ Then $T(X_1,\ldots,X_n)=\left(\sum\limits_{j=1}^nt_1(X_j),\ldots,\sum\limits_{j=1}^nt_k(X_j)\right)$ is the sufficient statistic for $(X_1,\ldots,X_n)$. Furthermore, if the exponential family is full-rank, then $T(X_1,\ldots,X_n)$ is minimal sufficient and complete.
Nonminimal exponential families or over-complete exponential families are those $\sum\limits_{i=1}^ka_iT_i(x)=c$ for some coefficient $a\in\mathbb{R}^k\setminus\{0\}$ and constant $c\in\mathbb{R}$ and for all $x\in\mathcal{X}$. If there are relationships between the $θ_i$'s (e.g., $\theta_2 = \theta_1^2$) then the exponential family is curved. The parameters of such exponential family are not statistically identifiable. So in the above theorem, we require that the family is full-rank so that the sufficient statistic is minimal.
Sometimes we will use the canonical parametrization $$p_{\theta}(x)=h(x)\exp\left\{\sum\limits_{i=1}^k\theta_it_i(x)-A(\theta)\right\}.$$ The term $A(\theta)$ is the log-normalizing factor. The set of $\theta$'s for which $A(\theta)<\infty$ constitutes the natural parameter space.
Log-partition generates moments
$A(\theta)$ is known as the cumulant function:
$$\begin{align} \frac{\partial A(\theta)}{\partial \theta} &= \mathbb{E}[T_i(X)]\\ \frac{\partial^2 A(\theta)}{\partial \theta_i \partial \theta_j} &= \text{Cov}(T_i(X), T_j(X)). \end{align}$$
Also, $A$ is a convex function of $\theta$ and thus the log-likelihood in an exponential family is a concave function of $\theta$.