8 min read

Endogeneity with censored variable

Bunching for the outcome variable

Recently I read this paper, Caetano et al (2020) https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3699644. It’s about bunching.

Bunching is a phenomenon where we see a mass point in the distribution of some variable. For example, Saez (2010) studies a mass point in the distribution of earned income at the kink point of the tax schedule. This is because people bunch at the kink point to take advantage of the tax break.

Bunching is similar to regression discontinuity. In this blog, https://blogs.worldbank.org/en/impactevaluations/ready-set-bunch, “As noted by Kleven (2016), regression discontinuity (RD) is a close cousin of bunching estimators. In regression discontinuity, we maintain the assumption that there is no such “manipulation” as described above. Bunching relaxes this assumption — instead, we estimate the fraction of manipulators by estimating what densities of individuals would have been without manipulation, that is the “manipulation-free counterfactual”. With both the observed and manipulation-free counterfactual distributions of individuals estimated, it may be possible to compare the two distributions to recover the fraction of individuals who manipulated.”

The main idea is to compare the bunching with the counterfactual distribution without bunching to estimate elasticities, for example, how people react to tax rate.

Bunching for the treatment variable

A test of endogeneity

Caetano (2015) showed that, if the distribution of \(T_i\) has bunching at \(T_i = \bar t\), it is possible to test the exogeneity of \(T_i\) . When one compares the outcome of observations at the bunching point and those around it, the treatment itself is very similar. Therefore, there cannot be more than a marginal difference in the outcome that is due to treatment variation, since the treatment hardly varies.

Caetano (2015)’s estimation can be done in a two step process:

  1. estimate \(E[Y_i | T_i = \bar T, X_i ]\) non-parametrically

  2. do a local linear regression \(E[Y_i | T_i = \bar T, X_i ] -Y_i\) onto \(T_i\) at \(\bar T\), using only observations such that \(T_i > \bar T\).

The approach is known as the Discontinuity Test.

Dummy test

Caetano and Maheshri (2018) proposed a simpler test, which is simply regressring \(Y_i\) on \(T_i\) , \(X_i\) and a dummy variable for \(T_i > \bar T\). The coefficient of the dummy variable is the test statistic. If the coefficient is significant, which is to say there is a discontinuity on outcome, then \(T_i\) is endogenous.

This is simpler, but with linear functional form assumption.

The logic of the endogeneity test is that around the bunching point, the treatment is very similar, so the outcome should be similar. If there is a discontinuity in the outcome, then it has to be one of two reasons. One is that treatment has discontinuous effect on outcome around the bunching point. Usually we can reasonably rule out that possibility. Second is that there is an unobserved variable that is causing both the treatment and the outcome. That unobserved variable has a discontinuous effect on the outcome. If that’s the case, then the treatment is endogenous.

Treatment effect estimation

Caetano et al (2020) proposed a method to estimate the treatment effect when there is bunching. It seems to me there is no reason that we cannot use it for the censoring situation. There are a lot of situations that we have censored treatment variable, when we study continuous treatment variable.

Suppose we have \[ Y = \beta T + Z \gamma + \delta \eta + \epsilon \]

where \(T\) is the treatment variable, \(\eta\) is the unobserved variable that is causing both the treatment and the outcome, \(Z\) is the control variables, and \(\epsilon\) is the error term. We only observe \(Y, T, Z\).

Supppose \(T^*\) is the latent variable that is censored at \(T^* = 0\). We observe \(T = max(T**, 0)\). We assume that \(T^*\) is continuous over its support.

\[ T^* = Z \pi + \eta \]

From the two equations, we can derive:

\[ E[Y | T, Z] = (\beta + \delta)T + Z (\gamma - \pi \delta) + \delta E[T^* | T^* \le 0, Z] 1(T=0) \] or

\[ E[Y | T, Z] = \beta T + Z (\gamma - \pi \delta) + \delta (T + E[T^* | T^* \le 0, Z] 1(T=0)) \]

Therefore we can “correct” for endogeneity by simply including an additional term \(\delta (T + E[T^* | T^* \le 0, Z] 1(T=0))\) in the regression.

However, the prediction of \(\delta (T + E[T^* | T^* \le 0, Z] 1(T=0))\) is out of sample. We have to make some additional assumptions for the distribution of \(T^*\). For example, we can assume that \(T^*\) is normally distributed. Or, we can assume it is symmetric. Then we can use the right tail to estimate the left tail.

Simulation

Let’s do a simulation to see how it works.

library(MASS)
library(ggplot2)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following object is masked from 'package:MASS':
## 
##     select
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
# set seed
set.seed(123)
# number of observations
n=1000
# generate Z and make it dataframe
Z=as.data.frame(mvrnorm(n=n, mu=c(0,0), Sigma=matrix(c(1,0.5,0.5,1),2,2)))
data <- Z
names(data) <- c("Z1", "Z2")
# generate eta
# eta is normally distributed
data$eta=rnorm(n, 1,1)
# generate T*
# T* is normally distributed
# Here I generate Y as a function of T*, we can also do a function of T.
data <- data %>%
  mutate(T_star=Z1 + 2*Z2 + 1.5*eta) %>%
  mutate(T=ifelse(T_star>0, T_star, 0)) %>%
  mutate(Y=2*T_star + 0.5*Z1 + 0.5*Z2 + 2*eta + rnorm(n, 0, 1))

# regress Y on T, Z1, Z2
lm1 <- lm(Y ~ T + Z1 + Z2, data=data)
summary(lm1)
## 
## Call:
## lm(formula = Y ~ T + Z1 + Z2, data = data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -19.3015  -1.5531   0.4095   2.1631   7.1942 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1.04656    0.19279  -5.429 7.14e-08 ***
## T            2.87428    0.07855  36.592  < 2e-16 ***
## Z1           0.50270    0.13406   3.750 0.000187 ***
## Z2           0.41043    0.16207   2.532 0.011480 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.295 on 996 degrees of freedom
## Multiple R-squared:  0.8273,	Adjusted R-squared:  0.8268 
## F-statistic:  1591 on 3 and 996 DF,  p-value: < 2.2e-16

Without including \(eta\), the effect estimate is biased. Let’s do a dummy test.

lm2 <- lm(Y ~ T + Z1 + Z2 + I(T>0), data=data)
summary(lm2)
## 
## Call:
## lm(formula = Y ~ T + Z1 + Z2 + I(T > 0), data = data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -18.8425  -1.1792   0.0749   1.5244   7.5902 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -4.02290    0.22946 -17.532  < 2e-16 ***
## T             2.66253    0.06848  38.880  < 2e-16 ***
## Z1           -0.03057    0.11874  -0.257  0.79688    
## Z2           -0.39400    0.14581  -2.702  0.00701 ** 
## I(T > 0)TRUE  4.99664    0.26634  18.760  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.834 on 995 degrees of freedom
## Multiple R-squared:  0.8724,	Adjusted R-squared:  0.8719 
## F-statistic:  1701 on 4 and 995 DF,  p-value: < 2.2e-16

It shows that the dummy variable is significant. It means that \(T\) is endogenous.

Correcting for endogeneity

Now let’s assume \(T^*\) is symmetrically distributed. It is left censored. We assume it is less than half censored. So the median remains the same for \(T^*\) and \(T\). We can use the right tail to estimate the left tail. Suppose there is a point at left tail \(x\), then the corresponding point at the right tail is \(2 M -x\), where \(M\) is the median of \(T^*\).

M <- median(data$T)
data2 <- data %>%
  filter(T>2*M)

lm3 <- lm(Y ~ T + Z1 + Z2, data=data2)

# here I create an extra term for the correction
# when T is 0, it is T, otherwise, it's 2M - the prediction from the right tail.
data <- data %>%
  mutate(extra=if_else(T>0,T,2*M-predict(lm3, newdata=data %>% mutate(T=2*M -T))))

lm4 <- lm(Y ~ T + Z1 + Z2 + extra, data=data)
summary(lm4)
## 
## Call:
## lm(formula = Y ~ T + Z1 + Z2 + extra, data = data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17.6228  -1.0492  -0.0294   1.1924   8.3056 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.94824    0.17624   5.380 9.26e-08 ***
## T            2.20954    0.06901  32.020  < 2e-16 ***
## Z1          -0.31782    0.11298  -2.813    0.005 ** 
## Z2          -1.14267    0.14570  -7.843 1.13e-14 ***
## extra        0.64428    0.02735  23.553  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.642 on 995 degrees of freedom
## Multiple R-squared:  0.8891,	Adjusted R-squared:  0.8887 
## F-statistic:  1995 on 4 and 995 DF,  p-value: < 2.2e-16

It does not seem to get the answer right 100%. But it’s better than the model without the correction term.

Conclusion

We are making assumptions for the distribution of the continuous treatment variable. Symmetry is the minimum assumption. We can make other parametric assumptions on the distribution of \(T^*\). This is a trade-off between making distribution assumptions and ignoring endogeneity. The good news is that we don’t need an instrument. We are using the censoring part of the treatment variable to deal with endogeneity.