Causal Inference
Published:
This post summarizes causal inference.
The Potential Outcomes Framework
Suppose we have two random variables $X$ and $Y$, where $Y$ is a binary outcome. We are interested in determining whether $X$ causes $Y$. However, for the $i$th unit/sample, we can also observe $Y_i^{obs}$, which is either $Y_i(1)$ or $Y_i(0)$. We aim to use $(X_i,Y_i^{obs})$'s to estimate some quantities related to the causal effect of treatment.
The causal estimands include
- The unit level causal effects $Y_i(1)-Y_i(0)$, $Y_i(1)/Y_i(0)$, and etc.
- The average treatment effect: $\hat{\tau}=\frac{1}{n}\sum\limits_{i=1}^n (Y_i(1)-Y_i(0))$ for estimating $\tau=\mathbb{E}[Y(1)-Y(0)]$.
- The average treatment effect over sub-populations: $\hat{\tau}_S=\frac{1}{|S|}\sum\limits_{i=1}^n (Y_i(1)-Y_i(0))\mathbf{1}\{i\in S\}$.
Suppose that if there is a causal link between a binary treatment $W$ and an outcome $Y$. For a particular unit, if $W_i= 1$ we say that the unit is treated and if $W_i = 0$ then the unit is in the control group. What we observe is $$Y_i^{obs}=W_iY_i(1)+(1-W_i)Y_i(0).$$ We can estimate the association $$\alpha=\mathbb{E}[Y(1)\mid W=1]-\mathbb{E}[Y(0)\mid W=0]$$ by an estimator $$\hat{\alpha}=\frac{1}{|T|}\sum\limits_{i\in T}Y_i^{obs}-\frac{1}{|C|}\sum\limits_{i\in C}Y_i^{obs}$$ where $T=\{i:W_i=1\}$ and $C=\{i:W_i=0\}$.
In general, the association is not equal to the average causal effect, i.e., $\alpha\neq\tau$, which is caused by the selection bias. However, if we consider randomized controlled trials (RCT) by enforcing the condition that $$W\perp\!\!\!\!\perp (Y(1),Y(0)),$$ then we have $\alpha=\tau$.
Confounding
Identification of $\tau$
For most studies, we do not have randomly trials and can only have observation data. To make causal inference from observational data possible in such case, we need to assume no unmeasured confounding or no selection on observables or ignorability. Formally, we assume $$W\perp\!\!\!\!\perp (Y(1),Y(0))\mid X,$$ where $X$ is some covariate.
Note that under the previous assumption, we have $$\begin{align*} \tau & = \mathbb{E}[Y(1)-Y(0)]\\ &= \mathbb{E}[\mathbb{E}[Y(1)-Y(0)\mid X]]\\ &= \mathbb{E}[\mathbb{E}[Y(1)\mid X,W=1]]-\mathbb{E}[\mathbb{E}[Y(0)\mid X,W=0]]\\ &= \mathbb{E}[\mathbb{E}[Y^{obs}\mid X,W=1]]-\mathbb{E}[\mathbb{E}[Y^{obs}\mid X,W=0]]. \end{align*}$$ Thus, we are able to estimate $\tau$ from the observed data $(X,W,Y^{obs})$ instead of the original data $(X,W,Y(1),Y(0))$.
Estimation
There are two ways to estimate $\tau$ based on such idea.
The first approach is based on regression estimators $\hat{\mu}_1$ and $\hat{\mu}_0$ of $$\begin{align*} \mu_1(x)&= \mathbb{E}[\mathbb{E}[Y^{obs}\mid X=x,W=1]]\\ \mu_0(x)&=\mathbb{E}[\mathbb{E}[Y^{obs}\mid X=x,W=0]]. \end{align*}$$ Then we can compute the plug-in estimator: $$\hat{\tau}= \frac{1}{n}\sum\limits_{i=1}^n[\hat{\mu}_1(X_i)-\hat{\mu}_0(X_i)].$$ One approximately correct way to think about this is that we are using regression to impute the missing potential outcomes for each individual.
The second approach is based on Horvitz-Thompson estimator or the inverse propensity score estimator. Firstly, we define the propensity score by $$\pi(x)=\mathbb{P}(W=1\mid X=x),$$ which represents the probability that a unit with covariates x receives treatment. Note that, $$\begin{align*} \mathbb{E}[W\mid X=x]&= \pi(x)\\ \mathbb{E}[1-W\mid X=x]&=1-\pi(x). \end{align*}$$ Then we have $$\begin{align*} \tau &= \mathbb{E}[\mathbb{E}[Y(1)-Y(0)\mid X]]\\ &= \mathbb{E}\left[\mathbb{E}\left[\frac{Y(1)W}{\pi(X)}-\frac{Y(0)(1-W)}{1-\pi(X)}\mid X\right]\right]\\ &= \mathbb{E}\left[\mathbb{E}\left[\frac{Y^{obs}W}{\pi(X)}\mid X\right]\right] - \mathbb{E}\left[\mathbb{E}\left[\frac{Y^{obs}(1-W)}{1-\pi(X)}\mid X\right]\right], \end{align*}$$ which can be estimated as $$\begin{align*} \hat{\tau} &= \frac{1}{n}\sum\limits_{i=1}^n \left[\frac{Y_i^{obs}W_i}{\pi(X_i)}-\frac{Y_i^{obs}(1-W_i)}{1-\pi(X_i)}\right]. \end{align*}$$ In practice, we need to estimate $\pi(x)$ by regression.
Comments
We use potential outcomes to describe causality and causal inference coming from the work of Neyman (and later Rubin). However, there are also other languages. For example, (causal) directed graphs pioneered by Judea Pearl. While they focus more on a much harder problem of causal discovery, which requires even stronger assumptions.