OLS cannot be ATE (most of the time)

Tymon Słoczyński has this cool paper(https://people.brandeis.edu/~tslocz/Sloczynski_paper_regression.pdf) which helps me understand OLS better.

A common setup in empirical studies:

\[ y = \alpha + \tau d + X \beta \]

\(d\) is the treatment dummy, \(X\) is the covariates. This model assumes homogeneous treatment effect. That is, \(\tau\) is the ATE if treatment effect is homogeneous. But, if not, then \(\tau\) is not ATE anymore, although most people treat it as ATE. Obviously we are assuming the usual unconfoundedness; that is, there are no other variables other than \(X\) are driving both \(d\) and \(y\).

In this paper, Słoczyński showed that \(\tau\) is actually a convex combination of ATT and ATU (average treatment effect for the untreated), if treatment has heterogeneous effects. Therefore it can be close to ATE, or it can be quite different, depending on the weight on ATT and ATU. More surprisingly, he showed that the weights are inversely proportional to the proportion of treated vs untreated. The more number of treated units, the less weight is on ATT!

By definition,

\[ \tau_{ATE} = \rho \tau_{ATT} + (1- \rho) \tau_{ATU} \]

The main result from the paper is:

\[ \tau = (1- \rho) \tau_{ATT} + \rho \tau_{ATU} \]

where \(\rho = P(d=1)\), the proportion of treated; \(\tau_{ATT} = E(y(1)-y(0) | d=1)\), \(\tau_{ATU} = E(y(1)-y(0) | d=0)\), and \(\tau_{ATE} = E(y(1)-y(0))\).

From these two equations, we see that \(\tau\) and \(\tau_{ATE}\) are quite different, unless \(\rho=.5\). So if we have balanced data between treatment and control, it’s no problem to interpret OLS coefficient as the ATE; otherwise, they can be quite different. The more unbalanced, the more different these two are.

These results are based on the fact ATT, ATU and ATE can be calculated if we have the true propensity score, then we can estimate the heterogeneous effect by including the propensity score as an additional covariate. If we don’t, then we can include the estimated propensity score as an additional covariate.

Słoczyński has a package in R and stata “hettreatreg”.

library(hettreatreg)
data("nswcps")

summary(lm(re78 ~ treated + age + age2 + educ + black + hispanic + married + nodegree  + re74 + re75, data = nswcps))

## 
## Call:
## lm(formula = re78 ~ treated + age + age2 + educ + black + hispanic + 
##     married + nodegree + re74 + re75, data = nswcps)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -25130  -3601   1274   3668  55040 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 7634.34415  736.67074  10.363  < 2e-16 ***
## treated      793.58704  548.25433   1.447  0.14778    
## age         -233.67749   41.18067  -5.674 1.42e-08 ***
## age2           1.81437    0.56099   3.234  0.00122 ** 
## educ         166.84923   28.65984   5.822 5.94e-09 ***
## black       -790.60856  213.24523  -3.708  0.00021 ***
## hispanic    -175.97512  218.99126  -0.804  0.42166    
## married      224.26599  149.84542   1.497  0.13450    
## nodegree     311.84453  178.51743   1.747  0.08068 .  
## re74           0.29534    0.01222  24.175  < 2e-16 ***
## re75           0.47064    0.01216  38.700  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7002 on 16166 degrees of freedom
## Multiple R-squared:  0.4762, Adjusted R-squared:  0.4758 
## F-statistic:  1469 on 10 and 16166 DF,  p-value: < 2.2e-16

outcome <- nswcps$re78
treated <- nswcps$treated
our_vars <- c("age", "age2", "educ", "black", "hispanic", "married", "nodegree", "re74", "re75")
covariates <- subset(nswcps, select = our_vars)

hettreatreg(outcome, treated, covariates, verbose = TRUE)

## 
## "OLS" is the estimated regression coefficient on treated. 
##  
##    OLS = 793.6 
##  
## P(d=1) = 0.011 
## P(d=0) = 0.989 
##  
##     w1 = 0.983 
##     w0 = 0.017 
##  delta = -0.971 
##  
##    ATE = -6751 
##    ATT = 928.4 
##    ATU = -6840 
##  
## OLS = w1*ATT + w0*ATU = 793.6 
##

To calculate ATT, ATU and ATE under heterogeneous treatment effects, this paper does it in three steps. First, estimate the propensity score equation, basically

\[ d = X \gamma \]

Second, include the predicted propensity score as a covariate and estimate the model with treatment and control group separately.

Third, predict the counterfactuals for the full sample. Then we get the ATE, ATT and ATU.

Let’s see with the same data:

m_propensity <- lm(treated ~ age + age2 + educ + black + hispanic + married + nodegree  + re74 + re75, data=nswcps)
ps <- predict(m_propensity)
df_combined <- data.frame(nswcps, ps)
df.1 <- df_combined[which(treated == 1),]
df.0 <- df_combined[which(treated == 0),]
m_ot <- lm(re78 ~ ps, data = df.1)
ot <- suppressWarnings(predict(m_ot, newdata = df_combined)) # add newdata option to predict for full sample

m_oc <- lm(re78 ~ ps, data = df.0)
oc <- suppressWarnings(predict(m_oc, newdata = df_combined)) # add newdata option to predict for full sample

te <- ot - oc
ate <- mean(te)
df_combined$te <- te
att <- as.numeric(mean(df_combined[which(treated == 1),'te']))
atu <- as.numeric(mean(df_combined[which(treated == 0),'te']))

print(paste("ate=", signif(ate, 4)))

## [1] "ate= -6751"

print(paste("att=", signif(att, 4)))

## [1] "att= 928.4"

print(paste("atu=", signif(atu, 4)))

## [1] "atu= -6840"

We see in this case OLS coefficient is hugely different from ATE. This is because \(P(d=1)\) is only .01, about 1 percent; we have much larger control group than treatment group. In this case, the OLS coefficient is far different from ATE; in fact, it’s pretty close to ATT.

Now, what’s the intuition that OLS coefficient is just the opposite as the ATE in terms of weighting ATT and ATU? The reason is that OLS is to predict actual outcome, not counterfactual outcome. For calculating ATE, we need counterfactual outcome predicted. Now suppose we have a lot of control observations, we’d much better predicting \((y(0)| d==0)\) rather than \((y(1) | d==1)\). That is, we are better getting the slope of \((y(0) | d==0, X=x)\); therefore predicting \((y(0) | d==1, X=x)\). Therefore although treatment group has less observations, but because the \((y(0) | d==1)\) is better predicted, the ATT is better estimated, thus more weight. If the opposite happens, then the ATU should be more heavily weighted. Thus the counter-intuitive weighting.

Regression adjustment

Słoczyński’s method is to include propensity score as a covariate. This propensity score serves as a proxy as all covariates. As Rubin and Rosenbaum showed, propensity score is sufficient statistic for all covariates. Another way to allow treatment effect to differ across all covariates is to allow interaction of treatment dummy with all covariates, then I am doing “Regression Adjustment” (see stata’s “treatreg RA”). We can calcuate ATE, ATT, ATU after the linear model.

library(tidyverse)

## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──

## ✓ ggplot2 3.3.5     ✓ purrr   0.3.4
## ✓ tibble  3.1.6     ✓ dplyr   1.0.7
## ✓ tidyr   1.1.4     ✓ stringr 1.4.0
## ✓ readr   2.1.1     ✓ forcats 0.5.1

## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

lm1 <- lm(re78 ~ treated*(age + age2 +  educ + black + hispanic + married + nodegree  + re74 + re75), data = nswcps)
y1 <- predict(lm1,newdata=nswcps %>% mutate(treated=1))
y0 <- predict(lm1,newdata=nswcps %>% mutate(treated=0))
ate=mean(y1-y0) #ate
print(paste("ate=",  signif(ate, 4)))

## [1] "ate= -4930"

y1.att <- predict(lm1,newdata=nswcps %>% filter(treated==1))
y0.att <- predict(lm1,newdata=nswcps %>% filter(treated==1) %>% mutate(treated=0))
att=mean(y1.att)-mean(y0.att) #att
print(paste('att=',  signif(att, 4)))

## [1] "att= 796"

y1.atu <- predict(lm1,newdata=nswcps %>% filter(treated==0) %>% mutate(treated=1))
y0.atu <- predict(lm1,newdata=nswcps %>% filter(treated==0))
atu=mean(y1.atu)-mean(y0.atu) #atu
print(paste('atu=', signif(atu, 4)))

## [1] "atu= -4996"

We see this is somewhat different from Słoczyński’s method. But the point is the original OLS coefficient 794 is nowhere close to ATE. In this case, it’s very close to ATT, as the treatment group is very small.

When is OLS coefficient the ATE

OLS cannot be ATE (most of the time)

Regression adjustment