Point Estimation

4 minute read

Published:

This post summarizes point estimation methods and some inportant quantities.

Point Estimation

Given $X_1,\ldots,X_n\sim p(X;\theta)\in\mathcal{P}_{\theta}$, we want to construct $\hat{\theta}(X_1,\ldots,X_n)$ that is close to $\theta$. Especially, we are interested in the following three problems:

  • General proper way to come up with $\hat{\theta}$'s.
  • Compare point estimators.
  • Define and find "optimal" estimators.

Usually we use "estimator" to refer to a random variable (a statistic, a function of the sample) and "estimate" to refer to its realized value.

Methods of Moment

The methods of moment can be illustrated via the following example, where we want to estamte the mean $\mu$ and variance $\sigma^2$ of $X$. We first have the following equations: $$\begin{align} \mathbb{E}[X]&=\mu\\ \mathbb{E}[X^2]&=\mu^2+\sigma^2. \end{align} $$ Replace the left-hand-side by the sample average, we have $$\begin{align} \frac{1}{n}\sum\limits_{i=1}^nX_i&=\mu\\ \frac{1}{n}\sum\limits_{i=1}^nX_i^2&=\mu^2+\sigma^2. \end{align} $$ This is a system of 2 equations containing two unkwnons, so in general we can solve it to obtain $\hat{\mu}$ and $\hat{\sigma}^2$. Such approach can be extended to estimate other paramters as long as the estimated parameters can be determined by solving a system of equations.

Maximum Likelihood Estimation

The maximum likelihood estimator (MLE) is given by $$\hat{\theta}=\mathop{\arg\max}_{\theta\in\Theta}L(\theta;X_1,\ldots,X_n).$$

If $\hat{\theta}$ is MLE for $\theta$, then $g(\hat{\theta})$ is MLE for $g(\theta)$ for any function $g$.

Bayes Estimators

Suppose the parameter is random $\theta\sim \pi(\theta)$, then the bayesian estimatro is given by $$\hat{\theta}=\int p(\theta\mid X_1,\ldots,X_n)\mathrm{d}\theta$$ where the posterior $p(\theta\mid X_1,\ldots,X_n)\propto p( X_1,\ldots,X_n\mid \theta)\pi(\theta)$.

Comparing Estimators

Unbiasedness

One desired property of a good estimator (but not necessary) is the unbiasedness. A estimator $\hat{\theta}_n$ is unbiased if $\mathbb{E}[\hat{\theta}_n]=\theta$.

Consistency

A estimator $\hat{\theta}_n$ is consistent if $\hat{\theta}_n\xrightarrow{p}\theta$.

MSE

For most cases, we will use mean square errors to evaluate how good a estimator is:

$$R(\hat{\theta},\theta)=\mathbb{E}[(\hat{\theta}-\theta)^2]=\mathbb{E}[(\hat{\theta}-\mathbb{E}[\hat{\theta}])^2]+(\mathbb{E}[\hat{\theta}]-\theta)^2.$$ which yields a variance-bias$^2$ decomposition. While the bias term is zero for any unbiased estimator, we need to balance the two terms so that the sum of them can be minimized. One strategy is to find the minimum variance unbiased estimator (as shown below); the other strategy is to find a biased estimator but with possibly smaller variance.

Before we present an important result on the lower bound for the variance of unbiased estimators, we introduce some important quantities.

  • The score function $s(\theta)=\nabla_{\theta}\log L(\theta;X_1,\ldots,X_n)=\sum\limits_{i=1}^n\nabla_{\theta}\log p(X_i;\theta)\in\mathbb{R}^d$. The score function has zero mean: $\mathbb{E}[s(\theta)]=0$.
  • The Fisher information $I(\theta)=\mathbb{E}[s(\theta)s(\theta)^{\top}]=\text{Var}(s(\theta))\in\mathbb{R}^{d\times d}$. If $\log p(x;\theta)$ is twice-differentiable, then $I(\theta)=nI_1(\theta)=-n\mathbb{E}[\nabla^2_{\theta}\log p(X;\theta)]$.

Then the following theorem says that the variance of any unbiased estimator must be lower bounded by some quantity.

(Cramer-Rao Lower Bound) Let $\hat{\theta}_n$ be an unbiased estimator of $\theta$, then $\text{Var}(\hat{\theta}_n)\geq \frac{1}{n I_1(\theta)}$.

Considering the previous variance-bias$^2$ decomposition, if we can find an unbiased estimator whose variance matches the Cramer-Rao Lower Bound, then this will be a uniform minimal variance unbiased estimator (UMVUE). But the converse is not true, since the variance of a UMVUE may exceed the Cramer-Rao Lower Bound.

If the MSE converges to zero, then the estimator is consistent. This is because that convergence in $L^2$ implies convergence in probability. But the opposite is not true in general (counter example).

Optimality

From a viewpoint of decision theory, we can further generalize the evaluation criterion. We use square loss previously, but we can instead use more general loss function, $L(\hat{\theta}(X_1,\ldots,X_n),\theta)$. For example, KL divergence between $p(X;\theta)$ and $p(X;\hat{\theta})$.

Then the error acorss samples is defined as the risk, $R(\theta,\hat{\theta})=\mathbb{E}[L(\hat{\theta},\theta)]$, where the expectation is taken with respect to $X_1,\ldots,X_n$ when keeping $\theta$ fixed.

Considering $\theta$ acorss the parameter space $\Theta$, we can give general evaulation metric for the estimators:

  • Maximum risk: $\overline{R}(\hat{\theta})=\sup\limits_{\theta\in\Theta}R(\hat{\theta},\theta)$.
  • Bayes risk: $B_{\pi}(\hat{\theta})=\mathbb{E}_{\pi}[R(\hat{\theta},\theta)]$ for some prior $\pi$ on $\theta$.

Then some optimality criterions are aviable for chossing good estimators:

  • Minimax-optimal risk: $R_n=\inf\limits_{\hat{\theta}}\overline{R}(\hat{\theta})$, which is obtained by the minimax estimators.
  • Bayes-optimal risk: $R_n=\inf\limits_{\hat{\theta}}B_{\pi}(\hat{\theta})$, which is obtained by the bayes estimators.

Bayes Risk and Bayes Estimators

One can show that $$\begin{align} B_{\pi}(\hat{\theta})&=\mathbb{E}_{\pi}[\mathbb{E}_{X_1,\ldots,X_n\mid \theta}[L(\hat{\theta},\theta)]]\\ &=\mathbb{E}_{X_1,\ldots,X_n}[\mathbb{E}_{\theta\mid X_1,\ldots,X_n}[L(\hat{\theta},\theta)]], \end{align},$$ if we can interchange the integrals. Thus, if $\hat{\theta}(X_1,\ldots,X_n)$ minimizes the posterior mean $\mathbb{E}_{\theta\mid X_1,\ldots,X_n}[L(\hat{\theta},\theta)]$ (as a function of $X_1,\ldots,X_n$), then it is a Bayes-optimal estimator.

  • If $L(\hat{\theta},\theta)=(\hat{\theta}-\theta)^2$, then the Bayes estimator is given by $\hat{\theta}=\mathbb{E}[\theta\mid X_1,\ldots,X_n]$.
  • If $L(\hat{\theta},\theta)=|\hat{\theta}-\theta|$, then the Bayes estimator is given by the median of the posterior $\theta\mid X_1,\ldots,X_n$.
  • If $L(\hat{\theta},\theta)=\mathbf{1}_{\{\hat{\theta}\neq\theta\}}$, then the Bayes estimator is given by the mode of the posterior $\theta\mid X_1,\ldots,X_n$.

Computing Minimax Optimal Estimators

There are two ways ri compute the minimax optimal estimators:

  • Bound the minimax risk.
    • Upper bound minimax risk by any $\tilde{\theta}$: $M\triangleq\inf\limits_{\hat{\theta}}\sup\limits_{\theta}R(\hat{\theta},\theta)\leq\sup\limits_{\theta}R(\tilde{\theta},\theta)$.
    • Lower bound minimax risk by the Bayes risk: $M\geq B_{\pi}(\hat{\theta}_{\pi})$.
  • Find the worst case piror $\pi$, the least favorable prior such that $M\leq B_{\pi}(\hat{\theta}_{\pi})$, then $\hat{\theta}_{\pi}$ is minimax.

If the Bayes risk is constant for some prior $\pi$, then $\pi$ is least favorable and $\hat{\theta}_{\pi}$ is minimax.