12 min read

Synthetic Control in R and Stata

Synthetic control

Abadie et al. have a few papers on synthetic control (https://economics.mit.edu/files/11859, https://economics.mit.edu/files/17847). The key idea is for a case study situation. A single unit, say a state/firm/country is exposed to a shock, and we are interested in the effect of that shock. For example, Card et al studied the effect of minimum wage increase for the state of New Jersey. They used diff-in-diff, with Pennsylvania as the control state. The point of synthetic control is that maybe parallel trend assumption does not hold for the control unit. Is Pennsylvania a good control for New Jersey? Does it follow the same trend as New Jersey? Maybe not. Synthetic control is to generate a more comparable control with some combination of multiple control units, which can have a much closer parallel trend as the treated unit, than using any of the control units.

Now, how to construct this synthetic control unit? Basically it is a weighted average of multiple control units. There are two sets of weights to be determined. One set for the control units, one set to determine the importance of predictors. Abadie et al (2010) propose to choose a set of \(w_i\)’s so that the resulting synthetic control best resembles the pre-intervention values for the treated unit of predictors of the outcome variable, subject to the constraints that the weights are non-negative and sum to 1. Another set of weights \(v_h\) for predictor importance is determined such that it minimizes the mean squared prediction error (MSPE) of this synthetic control with respect to the outcome in the pre-treatment period. The intuition of these weights are to pick \(w_i\)’s to make the weighted sum of each predictor as close to the treated unit’s value of predictor as close as possible, given a set of \(v_h\)’s. Then \(v_h\)’s are determined by how important each predictor is to predict the outcome. We don’t use any information post-treatment to determine the weights. But we can use pre-treatment outcome to see the relative importance of each predictor in terms of predictor outcome. We can do this by out-of-sample validation. Basically divide the pre-treatment period into training and validation periods (that’s one reason that the pre-treatment period cannot be too short). Iterate the process of choosing \(w_i\) and \(v_h\) to achieve a minimization of MSPE in the validation period.

inference

Since sythetic control is for case study situation, we have essentially one treated unit and one synthetic control. The usual inference method would not work. Abadie et al suggested a permutation based approach. The basic idea is to permute the label of treatment; in other words, suppose one control unit is now a treated unit. Then follow the same procedure to get the synthetic control for that “treated unit”; look at the effect. Now if the treatment is truely for the treated unit only, then there would not be any effect for all the control unit. Therefore we can have a distribution of treatment effects after permutation of treatment label.

an example

The canonical sythetic control is implemented in both R and Stata (https://web.stanford.edu/~jhain/synthpage.html). We use a package called “tidysynth” here to study one of the original examples, which evaluates the impact of Proposition 99 on cigarette consumption in California.

## Rows: 1,209
## Columns: 7
## $ state     <chr> "Rhode Island", "Tennessee", "Indiana", "Nevada", "Louisiana", "Oklahoma", "New Hampshire", "North Dakota", "Arkansas", "Virginia", "Illinoi…
## $ year      <dbl> 1970, 1970, 1970, 1970, 1970, 1970, 1970, 1970, 1970, 1970, 1970, 1970, 1970, 1970, 1970, 1970, 1970, 1970, 1970, 1970, 1970, 1970, 1970, 19…
## $ cigsale   <dbl> 123.9, 99.8, 134.6, 189.5, 115.9, 108.4, 265.7, 93.8, 100.3, 124.3, 124.8, 92.7, 65.5, 109.9, 93.4, 124.8, 104.3, 123.0, 106.4, 155.8, 128.5…
## $ lnincome  <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ beer      <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ age15to24 <dbl> 0.1831579, 0.1780438, 0.1765159, 0.1615542, 0.1851852, 0.1754592, 0.1707317, 0.1844660, 0.1690068, 0.1894216, 0.1669667, 0.1786787, 0.202077…
## $ retprice  <dbl> 39.3, 39.9, 30.6, 38.9, 34.3, 38.4, 31.4, 37.3, 36.7, 28.8, 41.4, 38.5, 34.6, 34.3, 36.2, 29.4, 39.1, 38.8, 40.4, 28.3, 38.0, 27.3, 34.0, 37…

## # A tibble: 7 × 4
##   variable     California synthetic_California donor_sample
##   <chr>             <dbl>                <dbl>        <dbl>
## 1 ln_income        10.1                  9.84         9.83 
## 2 ret_price        89.4                 89.4         87.3  
## 3 youth             0.174                0.174        0.173
## 4 beer_sales       24.3                 24.3         23.7  
## 5 cigsale_1975    127.                 127.         137.   
## 6 cigsale_1980    120.                 120.         138.   
## 7 cigsale_1988     90.1                 90.8        114.

## # A tibble: 39 × 8
##    unit_name     type    pre_mspe post_mspe mspe_ratio  rank fishers_exact_pvalue z_score
##    <chr>         <chr>      <dbl>     <dbl>      <dbl> <int>                <dbl>   <dbl>
##  1 California    Treated     3.94     390.       99.0      1               0.0256  5.13  
##  2 Georgia       Donor       3.48     174.       49.8      2               0.0513  2.33  
##  3 Virginia      Donor       5.86     171.       29.2      3               0.0769  1.16  
##  4 Indiana       Donor      18.4      415.       22.6      4               0.103   0.787 
##  5 West Virginia Donor      14.3      287.       20.1      5               0.128   0.646 
##  6 Connecticut   Donor      27.3      335.       12.3      6               0.154   0.202 
##  7 Nebraska      Donor       6.47      54.3       8.40     7               0.179  -0.0189
##  8 Missouri      Donor       9.19      77.0       8.38     8               0.205  -0.0199
##  9 Texas         Donor      24.5      160.        6.54     9               0.231  -0.125 
## 10 Idaho         Donor      53.2      340.        6.39    10               0.256  -0.133 
## # … with 29 more rows
## # A tibble: 31 × 3
##    time_unit real_y synth_y
##        <dbl>  <dbl>   <dbl>
##  1      1970   123     116.
##  2      1971   121     118.
##  3      1972   124.    123.
##  4      1973   124.    124.
##  5      1974   127.    126.
##  6      1975   127.    127.
##  7      1976   128     127.
##  8      1977   126.    125.
##  9      1978   126.    125.
## 10      1979   122.    122.
## # … with 21 more rows
## # A tibble: 1,209 × 5
##    .id        .placebo time_unit real_y synth_y
##    <chr>         <dbl>     <dbl>  <dbl>   <dbl>
##  1 California        0      1970   123     116.
##  2 California        0      1971   121     118.
##  3 California        0      1972   124.    123.
##  4 California        0      1973   124.    124.
##  5 California        0      1974   127.    126.
##  6 California        0      1975   127.    127.
##  7 California        0      1976   128     127.
##  8 California        0      1977   126.    125.
##  9 California        0      1978   126.    125.
## 10 California        0      1979   122.    122.
## # … with 1,199 more rows
## # A tibble: 1,482 × 50
##    .id        .placebo .type   time_unit California `Rhode Island` Tennessee Indiana Nevada Louisiana Oklahoma `New Hampshire` `North Dakota` Arkansas Virginia
##    <chr>         <dbl> <chr>       <dbl>      <dbl>          <dbl>     <dbl>   <dbl>  <dbl>     <dbl>    <dbl>           <dbl>          <dbl>    <dbl>    <dbl>
##  1 California        0 treated      1970       123              NA        NA      NA     NA        NA       NA              NA             NA       NA       NA
##  2 California        0 treated      1971       121              NA        NA      NA     NA        NA       NA              NA             NA       NA       NA
##  3 California        0 treated      1972       124.             NA        NA      NA     NA        NA       NA              NA             NA       NA       NA
##  4 California        0 treated      1973       124.             NA        NA      NA     NA        NA       NA              NA             NA       NA       NA
##  5 California        0 treated      1974       127.             NA        NA      NA     NA        NA       NA              NA             NA       NA       NA
##  6 California        0 treated      1975       127.             NA        NA      NA     NA        NA       NA              NA             NA       NA       NA
##  7 California        0 treated      1976       128              NA        NA      NA     NA        NA       NA              NA             NA       NA       NA
##  8 California        0 treated      1977       126.             NA        NA      NA     NA        NA       NA              NA             NA       NA       NA
##  9 California        0 treated      1978       126.             NA        NA      NA     NA        NA       NA              NA             NA       NA       NA
## 10 California        0 treated      1979       122.             NA        NA      NA     NA        NA       NA              NA             NA       NA       NA
## # … with 1,472 more rows, and 35 more variables: Illinois <dbl>, South Dakota <dbl>, Utah <dbl>, Georgia <dbl>, Mississippi <dbl>, Colorado <dbl>,
## #   Minnesota <dbl>, Texas <dbl>, Kentucky <dbl>, Maine <dbl>, North Carolina <dbl>, Montana <dbl>, Vermont <dbl>, Iowa <dbl>, Connecticut <dbl>, Kansas <dbl>,
## #   Delaware <dbl>, Wisconsin <dbl>, Idaho <dbl>, New Mexico <dbl>, West Virginia <dbl>, Pennsylvania <dbl>, South Carolina <dbl>, Ohio <dbl>, Nebraska <dbl>,
## #   Missouri <dbl>, Alabama <dbl>, Wyoming <dbl>, .predictors <list>, .synthetic_control <list>, .unit_weights <list>, .predictor_weights <list>,
## #   .original_data <list>, .meta <list>, .loss <list>

Generalized Synthetic Control

The idea of generalized SC is to combine both an interactive fixed effect model and synthetic control. It is basically a panel data synthetic control. It naturally takes multiple treated units and uses bootstrap to get standard errors. It generates ATT with standard errors.

Examples are here: https://yiqingxu.org/packages/gsynth/gsynth_examples.html

## Cross-validating ... 
##  r = 0; sigma2 = 1.84865; IC = 1.02023; PC = 1.74458; MSPE = 2.37280
##  r = 1; sigma2 = 1.51541; IC = 1.20588; PC = 1.99818; MSPE = 1.71743
##  r = 2; sigma2 = 0.99737; IC = 1.16130; PC = 1.69046; MSPE = 1.14540*
##  r = 3; sigma2 = 0.94664; IC = 1.47216; PC = 1.96215; MSPE = 1.15032
##  r = 4; sigma2 = 0.89411; IC = 1.76745; PC = 2.19241; MSPE = 1.21397
##  r = 5; sigma2 = 0.85060; IC = 2.05928; PC = 2.40964; MSPE = 1.23876
## 
##  r* = 2
## 
## 
Simulating errors .............
Bootstrapping ...
## ..........
##    user  system elapsed 
##   5.789   0.000   5.805
## Call:
## gsynth.formula(formula = Y ~ D + X1 + X2, data = simdata, index = c("id", 
##     "time"), force = "two-way", r = c(0, 5), CV = TRUE, se = TRUE, 
##     nboots = 1000, inference = "parametric", parallel = FALSE)
## 
## Average Treatment Effect on the Treated:
##         Estimate   S.E. CI.lower CI.upper p.value
## ATT.avg    5.544 0.2612    5.032    6.055       0
## 
##    ~ by Period (including Pre-treatment Periods):
##           ATT   S.E. CI.lower CI.upper   p.value n.Treated
## -19  0.392160 0.5383  -0.6629  1.44725 4.663e-01         0
## -18  0.276548 0.4453  -0.5963  1.14937 5.346e-01         0
## -17 -0.275393 0.5315  -1.3172  0.76640 6.044e-01         0
## -16  0.441201 0.4384  -0.4181  1.30047 3.142e-01         0
## -15 -0.889595 0.4784  -1.8273  0.04807 6.296e-02         0
## -14  0.593891 0.4595  -0.3067  1.49446 1.962e-01         0
## -13  0.528749 0.3850  -0.2259  1.28338 1.697e-01         0
## -12  0.171569 0.5457  -0.8980  1.24110 7.532e-01         0
## -11  0.610832 0.4625  -0.2956  1.51722 1.865e-01         0
## -10  0.170597 0.4726  -0.7557  1.09687 7.181e-01         0
## -9  -0.271892 0.5605  -1.3705  0.82670 6.276e-01         0
## -8   0.094843 0.5242  -0.9325  1.12223 8.564e-01         0
## -7  -0.651976 0.5524  -1.7346  0.43066 2.379e-01         0
## -6   0.573686 0.4650  -0.3377  1.48507 2.173e-01         0
## -5  -0.469686 0.4875  -1.4253  0.48589 3.354e-01         0
## -4  -0.077766 0.5481  -1.1521  0.99655 8.872e-01         0
## -3  -0.141785 0.5683  -1.2555  0.97198 8.030e-01         0
## -2  -0.157100 0.4065  -0.9539  0.63970 6.992e-01         0
## -1  -0.915575 0.5371  -1.9682  0.13706 8.824e-02         0
## 0   -0.003309 0.3453  -0.6801  0.67348 9.924e-01         0
## 1    1.235962 0.7176  -0.1706  2.64249 8.502e-02         5
## 2    1.630264 0.5545   0.5434  2.71708 3.282e-03         5
## 3    2.712178 0.5746   1.5860  3.83832 2.354e-06         5
## 4    3.466758 0.7110   2.0732  4.86029 1.083e-06         5
## 5    5.740132 0.5429   4.6760  6.80424 0.000e+00         5
## 6    5.280035 0.5479   4.2062  6.35391 0.000e+00         5
## 7    8.436485 0.4708   7.5137  9.35928 0.000e+00         5
## 8    7.839902 0.6607   6.5450  9.13476 0.000e+00         5
## 9    9.455115 0.5314   8.4135 10.49672 0.000e+00         5
## 10   9.638509 0.4949   8.6685 10.60852 0.000e+00         5
## 
## Coefficients for the Covariates:
##     beta    S.E. CI.lower CI.upper p.value
## X1 1.022 0.03143   0.9603    1.083       0
## X2 3.053 0.02916   2.9959    3.110       0
##              ATT      S.E.   CI.lower    CI.upper      p.value n.Treated
## -19  0.392159788 0.5383217 -0.6629313  1.44725089 4.663162e-01         0
## -18  0.276547958 0.4453278 -0.5962785  1.14937444 5.346005e-01         0
## -17 -0.275392926 0.5315388 -1.3171899  0.76640403 6.043850e-01         0
## -16  0.441201288 0.4384083 -0.4180632  1.30046578 3.142373e-01         0
## -15 -0.889595124 0.4784093 -1.8272602  0.04806992 6.295838e-02         0
## -14  0.593890957 0.4594830 -0.3066791  1.49446101 1.961771e-01         0
## -13  0.528749012 0.3850232 -0.2258825  1.28338055 1.696618e-01         0
## -12  0.171568737 0.5456898 -0.8979636  1.24110106 7.532119e-01         0
## -11  0.610832288 0.4624523 -0.2955575  1.51722209 1.865498e-01         0
## -10  0.170597468 0.4725947 -0.7556710  1.09686596 7.181140e-01         0
## -9  -0.271891657 0.5605161 -1.3704829  0.82669962 6.276240e-01         0
## -8   0.094842558 0.5241845 -0.9325402  1.12222527 8.564197e-01         0
## -7  -0.651975701 0.5523735 -1.7346080  0.43065656 2.378743e-01         0
## -6   0.573686472 0.4649980 -0.3376928  1.48506575 2.172999e-01         0
## -5  -0.469685905 0.4875497 -1.4252657  0.48589385 3.353668e-01         0
## -4  -0.077766449 0.5481285 -1.1520786  0.99654571 8.871777e-01         0
## -3  -0.141784521 0.5682564 -1.2555467  0.97197765 8.029679e-01         0
## -2  -0.157100323 0.4065397 -0.9539035  0.63970286 6.991761e-01         0
## -1  -0.915575087 0.5370691 -1.9682112  0.13706099 8.823878e-02         0
## 0   -0.003308833 0.3453058 -0.6800958  0.67347815 9.923545e-01         0
## 1    1.235962010 0.7176306 -0.1705682  2.64249222 8.501853e-02         5
## 2    1.630264312 0.5545081  0.5434484  2.71708022 3.281922e-03         5
## 3    2.712177702 0.5745724  1.5860365  3.83831895 2.354497e-06         5
## 4    3.466757691 0.7109978  2.0732275  4.86028786 1.083109e-06         5
## 5    5.740132310 0.5429224  4.6760239  6.80424073 0.000000e+00         5
## 6    5.280034526 0.5479067  4.2061571  6.35391194 0.000000e+00         5
## 7    8.436484821 0.4708215  7.5136917  9.35927791 0.000000e+00         5
## 8    7.839901526 0.6606565  6.5450387  9.13476439 0.000000e+00         5
## 9    9.455114684 0.5314416  8.4135084 10.49672101 0.000000e+00         5
## 10   9.638509457 0.4949127  8.6684985 10.60852043 0.000000e+00         5
##         Estimate      S.E. CI.lower CI.upper p.value
## ATT.avg 5.543534 0.2611512 5.031687 6.055381       0
##        beta       S.E.  CI.lower CI.upper p.value
## X1 1.021890 0.03142726 0.9602939 1.083487       0
## X2 3.052994 0.02915502 2.9958516 3.110137       0
## Parallel computing ...
## Cross-validating ... 
##  r = 0; sigma2 = 1.84865; IC = 1.02023; PC = 1.74458; MSPE = 2.37280
##  r = 1; sigma2 = 1.51541; IC = 1.20588; PC = 1.99818; MSPE = 1.71743
##  r = 2; sigma2 = 0.99737; IC = 1.16130; PC = 1.69046; MSPE = 1.14540*
##  r = 3; sigma2 = 0.94664; IC = 1.47216; PC = 1.96215; MSPE = 1.15032
##  r = 4; sigma2 = 0.89411; IC = 1.76745; PC = 2.19241; MSPE = 1.21397
##  r = 5; sigma2 = 0.85060; IC = 2.05928; PC = 2.40964; MSPE = 1.23876
## 
##  r* = 2
## 
## 
Simulating errors ...
Bootstrapping ...
## 
##    user  system elapsed 
##   1.317   0.002  19.123
##           CATT      S.E.   CI.lower   CI.upper p.value
## 0 -0.003308833 0.3266574 -0.5773503  0.7000896   0.918
## 1  1.232653177 0.8510226 -0.3575374  2.9809081   0.142
## 2  2.862917489 1.0614389  0.8444397  4.9409303   0.008
## 3  5.575095192 1.1981878  3.2156968  7.9073223   0.000
## 4  9.041852883 1.5605409  5.9440932 12.1205905   0.000
## 5 14.781985192 1.8420618 11.1454249 18.2272044   0.000
##         CATT      S.E.   CI.lower  CI.upper p.value
## 0 -0.2277091 0.3850916 -0.9324429 0.5775746   0.556
## 1  2.2491804 0.8499516  0.5569541 3.8722829   0.010
## 2  2.3930760 0.6816647  0.9888229 3.7416706   0.000
## 3  2.3067796 0.6929185  0.8417326 3.6171810   0.000
## 4  2.5812540 0.8621074  0.8097772 4.1672884   0.008
## 5  4.7445071 0.6666032  3.3288818 6.0082990   0.000