6 min read

Intro to promimal causal inference

This is based on papers of Tchetgen Tchetgen and co-authors, mostly based on Xu Shi’s talk on proximal causal inference.

Unobserved confounder

Unobserved confounder is the problem for observational studies. Assuming controlling for all relevant confounders (X), we have an unobserved confounder U that affects both treatment A and outcome Y. The causal diagram is as follows:

In this case, we have biased estimates of the treatment effect, regressing Y on A controlling for X.

Proximal causal inference

Proximal causal inference (also called negative controls) is a framework that allows us to estimate the causal effect of treatment A on outcome Y while accounting for unobserved confounders.

We’ll need a DAG like this:

Basically we need two proxies for U to control for confounding. Z is called Negative Control for Exposure (NCE) and W is called Negative Control for Outcome (NCO). There is no link form Z to Y, and no link from W to A or Z.

Z is an NCE if Y(a,z)=Y(a) and ZY(a)|U, meaning that Z satisfies the exclusion restriction that there is no direct link from Z to Y, and that Z is independent of Y given U.

W is an NCO if W(a,z)=W and W(A,Z)|U, meaning that A and Z have no direct link to W.

Why this works

If we regress Y on A, then we get βay+βuyβua, which is biased. How do we recover βay?

Suppose we have W as proxy of U and βuw=βuy, then we can get the true effect. We can do regression of Y on A, and then W on A. The first regression gives us βay+βuyβua, and the second regression gives us βuwβua. We can then get βay from the difference of the two coefficients. This is similar to a DiD setting. Say W is Yt1, and assume the effect of U is the same on Y and Yt1, then effect of A on Y by the difference of regressing Y on A and Yt1 on A.

However, most likely βuwβuy. Then we need another proxy Z of U to recover the true effect. This is also called “double proxy”, or “double negative control”.

A linear model

Suppose we have the following linear model:

E[Y|A,Z,U]=βy0+βayA+βuyU E[W|A,Z,U]=βw0+βuwU E[U|A,Z]=βu0+βzuZ+βauA

βay is the true effect of A on Y.

E[Y|A,Z]=βy0+βayA+βuy(βu0+βzuZ+βauA)

E[W|A,Z]=βw0+βuw(βu0+βzuZ+βauA)

From the last two regressions we can recover βay:

If we define δay=βay+βuyβau (regression coefficient of A on the Y regression) and δaw=βuwβau, (regression coefficient of A on the W regression), similarly for δzy and δzw, then we have

βay=δayδawδzyδzw

Or, we can look at this as a two-stage least squares (2SLS) regression.
We can replace U with the predicted value of W from the second equation:

E[Y|A,Z]=βy0+βayA+βuyE[W|A,Z]

We first regress W on A and Z, and then regress Y on A and the predicted value of W.

A nonparametric model

We don’t have to have a linear model, this can be done with a nonparametric model as well.

E[Y(a)]=E[h(a,W)] where E[Y|A,Z]=E[h(A,W)|A,Z].

That is, we can estimate W as a function of A and W, and Z nonparametrically. This h function is called outcome bridge function. In the linear case, it is h(A,W)=β0+βaA+βwW.

How to find proximal variables

Tchetgen Tchetgen and co-authors come up with a Data-driven Automated Negative Control Estimation (DANCE) algorithm to find negative controls. The idea is to use a machine learning algorithm to find the variables that are correlated with the treatment and the outcome, but not with the treatment and the outcome together. This is similar to the idea of finding instrumental variables.

example

Let’s generated data based on this DAG:

set.seed(123)
n <- 5000
# Set coefficients
b_uw <- 0.5 # effect of U on W
b_uy <- 0.5 # effect of U on Y
b_ay <- 0.5 # effect of A on Y
b_uz <- 0.5 # effect of U on Z
b_ua <- 0.5 # effect of U on A
b_za <- 0.5 # effect of Z on A

U <- rnorm(n)
W <- b_uw * U + rnorm(n) # W is a function of U
Z <- b_uz * U + rnorm(n) # Z is a function of U
A <- b_ua * U + b_za * Z + rnorm(n) # A is a function of U and Z
Y <- b_uy * U + b_ay * A + rnorm(n) # Y is a function of U and A
# A <- rbinom(n, 1, plogis(-1 + U / 2)) # A is a binary treatment


# Create a data frame with the generated variables

data <- data.frame(W = W, Z = Z, A = A, Y = Y)

If we run regression of Y on A, Z, W,

m1 <- lm(Y ~ A + Z + W, data = data)
summary(m1)
## 
## Call:
## lm(formula = Y ~ A + Z + W, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.5836 -0.7167  0.0005  0.7305  4.4866 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.006347   0.015279  -0.415    0.678    
## A            0.660021   0.014123  46.734  < 2e-16 ***
## Z            0.085004   0.016650   5.105 3.43e-07 ***
## W            0.132798   0.014165   9.375  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.08 on 4996 degrees of freedom
## Multiple R-squared:  0.4556,	Adjusted R-squared:  0.4553 
## F-statistic:  1394 on 3 and 4996 DF,  p-value: < 2.2e-16

We don’t get the true effect of A on Y.

If we do the 2sls using W and Z, then we get the true effect.

# Linear (Y, W) Case
# Perform the first-stage linear regression W ~ A + Z
m1 <- lm(W ~ A + Z, data = data)
# Compute the estimated E[W|A,Z]
S <- m1$fitted.values
# Perform the second-stage linear regression Y ~ A + E[W|A,Z]
m2 <- lm(Y ~ A + S, data = data)
summary(m2)
## 
## Call:
## lm(formula = Y ~ A + S, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.5477 -0.7309  0.0059  0.7395  4.3715 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.005049   0.015408  -0.328    0.743    
## A            0.489342   0.043210  11.325  < 2e-16 ***
## S            1.143480   0.199171   5.741 9.96e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.089 on 4997 degrees of freedom
## Multiple R-squared:  0.446,	Adjusted R-squared:  0.4458 
## F-statistic:  2012 on 2 and 4997 DF,  p-value: < 2.2e-16