OLS cannot be ATE (most of the time)
Tymon Słoczyński has this cool paper(https://people.brandeis.edu/~tslocz/Sloczynski_paper_regression.pdf) which helps me understand OLS better.
A common setup in empirical studies:
\[ y = \alpha + \tau d + X \beta \]
\(d\) is the treatment dummy, \(X\) is the covariates. This model assumes homogeneous treatment effect. That is, \(\tau\) is the ATE if treatment effect is homogeneous. But, if not, then \(\tau\) is not ATE anymore, although most people treat it as ATE. Obviously we are assuming the usual unconfoundedness; that is, there are no other variables other than \(X\) are driving both \(d\) and \(y\).
In this paper, Słoczyński showed that \(\tau\) is actually a convex combination of ATT and ATU (average treatment effect for the untreated), if treatment has heterogeneous effects. Therefore it can be close to ATE, or it can be quite different, depending on the weight on ATT and ATU. More surprisingly, he showed that the weights are inversely proportional to the proportion of treated vs untreated. The more number of treated units, the less weight is on ATT!
By definition,
\[ \tau_{ATE} = \rho \tau_{ATT} + (1- \rho) \tau_{ATU} \]
The main result from the paper is:
\[ \tau = (1- \rho) \tau_{ATT} + \rho \tau_{ATU} \]
where \(\rho = P(d=1)\), the proportion of treated; \(\tau_{ATT} = E(y(1)-y(0) | d=1)\), \(\tau_{ATU} = E(y(1)-y(0) | d=0)\), and \(\tau_{ATE} = E(y(1)-y(0))\).
From these two equations, we see that \(\tau\) and \(\tau_{ATE}\) are quite different, unless \(\rho=.5\). So if we have balanced data between treatment and control, it’s no problem to interpret OLS coefficient as the ATE; otherwise, they can be quite different. The more unbalanced, the more different these two are.
These results are based on the fact ATT, ATU and ATE can be calculated if we have the true propensity score, then we can estimate the heterogeneous effect by including the propensity score as an additional covariate. If we don’t, then we can include the estimated propensity score as an additional covariate.
Słoczyński has a package in R and stata “hettreatreg”.
library(hettreatreg)
data("nswcps")
summary(lm(re78 ~ treated + age + age2 + educ + black + hispanic + married + nodegree + re74 + re75, data = nswcps))
##
## Call:
## lm(formula = re78 ~ treated + age + age2 + educ + black + hispanic +
## married + nodegree + re74 + re75, data = nswcps)
##
## Residuals:
## Min 1Q Median 3Q Max
## -25130 -3601 1274 3668 55040
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7634.34415 736.67074 10.363 < 2e-16 ***
## treated 793.58704 548.25433 1.447 0.14778
## age -233.67749 41.18067 -5.674 1.42e-08 ***
## age2 1.81437 0.56099 3.234 0.00122 **
## educ 166.84923 28.65984 5.822 5.94e-09 ***
## black -790.60856 213.24523 -3.708 0.00021 ***
## hispanic -175.97512 218.99126 -0.804 0.42166
## married 224.26599 149.84542 1.497 0.13450
## nodegree 311.84453 178.51743 1.747 0.08068 .
## re74 0.29534 0.01222 24.175 < 2e-16 ***
## re75 0.47064 0.01216 38.700 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7002 on 16166 degrees of freedom
## Multiple R-squared: 0.4762, Adjusted R-squared: 0.4758
## F-statistic: 1469 on 10 and 16166 DF, p-value: < 2.2e-16
outcome <- nswcps$re78
treated <- nswcps$treated
our_vars <- c("age", "age2", "educ", "black", "hispanic", "married", "nodegree", "re74", "re75")
covariates <- subset(nswcps, select = our_vars)
hettreatreg(outcome, treated, covariates, verbose = TRUE)
##
## "OLS" is the estimated regression coefficient on treated.
##
## OLS = 793.6
##
## P(d=1) = 0.011
## P(d=0) = 0.989
##
## w1 = 0.983
## w0 = 0.017
## delta = -0.971
##
## ATE = -6751
## ATT = 928.4
## ATU = -6840
##
## OLS = w1*ATT + w0*ATU = 793.6
##
To calculate ATT, ATU and ATE under heterogeneous treatment effects, this paper does it in three steps. First, estimate the propensity score equation, basically
\[ d = X \gamma \]
Second, include the predicted propensity score as a covariate and estimate the model with treatment and control group separately.
Third, predict the counterfactuals for the full sample. Then we get the ATE, ATT and ATU.
Let’s see with the same data:
m_propensity <- lm(treated ~ age + age2 + educ + black + hispanic + married + nodegree + re74 + re75, data=nswcps)
ps <- predict(m_propensity)
df_combined <- data.frame(nswcps, ps)
df.1 <- df_combined[which(treated == 1),]
df.0 <- df_combined[which(treated == 0),]
m_ot <- lm(re78 ~ ps, data = df.1)
ot <- suppressWarnings(predict(m_ot, newdata = df_combined)) # add newdata option to predict for full sample
m_oc <- lm(re78 ~ ps, data = df.0)
oc <- suppressWarnings(predict(m_oc, newdata = df_combined)) # add newdata option to predict for full sample
te <- ot - oc
ate <- mean(te)
df_combined$te <- te
att <- as.numeric(mean(df_combined[which(treated == 1),'te']))
atu <- as.numeric(mean(df_combined[which(treated == 0),'te']))
print(paste("ate=", signif(ate, 4)))
## [1] "ate= -6751"
print(paste("att=", signif(att, 4)))
## [1] "att= 928.4"
print(paste("atu=", signif(atu, 4)))
## [1] "atu= -6840"
We see in this case OLS coefficient is hugely different from ATE. This is because \(P(d=1)\) is only .01, about 1 percent; we have much larger control group than treatment group. In this case, the OLS coefficient is far different from ATE; in fact, it’s pretty close to ATT.
Now, what’s the intuition that OLS coefficient is just the opposite as the ATE in terms of weighting ATT and ATU? The reason is that OLS is to predict actual outcome, not counterfactual outcome. For calculating ATE, we need counterfactual outcome predicted. Now suppose we have a lot of control observations, we’d much better predicting \((y(0)| d==0)\) rather than \((y(1) | d==1)\). That is, we are better getting the slope of \((y(0) | d==0, X=x)\); therefore predicting \((y(0) | d==1, X=x)\). Therefore although treatment group has less observations, but because the \((y(0) | d==1)\) is better predicted, the ATT is better estimated, thus more weight. If the opposite happens, then the ATU should be more heavily weighted. Thus the counter-intuitive weighting.
Regression adjustment
Słoczyński’s method is to include propensity score as a covariate. This propensity score serves as a proxy as all covariates. As Rubin and Rosenbaum showed, propensity score is sufficient statistic for all covariates. Another way to allow treatment effect to differ across all covariates is to allow interaction of treatment dummy with all covariates, then I am doing “Regression Adjustment” (see stata’s “treatreg RA”). We can calcuate ATE, ATT, ATU after the linear model.
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✓ ggplot2 3.3.5 ✓ purrr 0.3.4
## ✓ tibble 3.1.6 ✓ dplyr 1.0.7
## ✓ tidyr 1.1.4 ✓ stringr 1.4.0
## ✓ readr 2.1.1 ✓ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
lm1 <- lm(re78 ~ treated*(age + age2 + educ + black + hispanic + married + nodegree + re74 + re75), data = nswcps)
y1 <- predict(lm1,newdata=nswcps %>% mutate(treated=1))
y0 <- predict(lm1,newdata=nswcps %>% mutate(treated=0))
ate=mean(y1-y0) #ate
print(paste("ate=", signif(ate, 4)))
## [1] "ate= -4930"
y1.att <- predict(lm1,newdata=nswcps %>% filter(treated==1))
y0.att <- predict(lm1,newdata=nswcps %>% filter(treated==1) %>% mutate(treated=0))
att=mean(y1.att)-mean(y0.att) #att
print(paste('att=', signif(att, 4)))
## [1] "att= 796"
y1.atu <- predict(lm1,newdata=nswcps %>% filter(treated==0) %>% mutate(treated=1))
y0.atu <- predict(lm1,newdata=nswcps %>% filter(treated==0))
atu=mean(y1.atu)-mean(y0.atu) #atu
print(paste('atu=', signif(atu, 4)))
## [1] "atu= -4996"
We see this is somewhat different from Słoczyński’s method. But the point is the original OLS coefficient 794 is nowhere close to ATE. In this case, it’s very close to ATT, as the treatment group is very small.