Why do we need spatial econometrics?
‘Everything is related to everything else, but near things are more related than distant things’ (Tobler 1970). This is the first law of geography. This is the reason why we need spatial econometrics. The spatial econometrics is to account for the spatial dependence in the data. The spatial dependence is the correlation between the observations that are close to each other. This is a common feature in many datasets. For example, in the housing market, the price of a house is related to the prices of the houses nearby. In the crime data, the crime rate in one neighborhood is related to the crime rate in the neighboring neighborhoods. In the health data, the health outcomes of people in one area are related to the health outcomes of people in the neighboring areas. In the environmental data, the pollution level in one area is related to the pollution level in the neighboring areas. In the economic data, the economic performance of one region is related to the economic performance of the neighboring regions. In the social network data, the behavior of one person is related to the behavior of the people in the neighboring areas.
Econometricians love OLS. However, the robustness of OLS relies on i.i.d. data. If our data is spatially correlated, then OLS is biased, without taking account of the spatial structure.
A genaral linear spatial model
where is the dependent variable, is the matrix of independent variables, is the vector of coefficients, and is the error term. The error term is spatially correlated.
We can see the major difference is . The term is called “spatially lagged y”, which in my opinion is misleading. This term is likely from time-series counter-part of lagged , but there is nothing about lags in the spatial data. is the spatial weights matrix. It is a matrix that describes the spatial structure of the data. The simplest example of is the case that we define an element of , to be 1 if unit and are neighbors, 0 if not.
In this simplest case, our model is saying we have a pattern we need to take account of: neighboring units are related, they seem to display similar pattern in terms of our study. This commonality of neighbors can be causal, for example, if we study the effect of minimum wage effect on employment, then if the neighboring state’s wage increase might negatively impact the focal state’s employment, because people go for higher wages. This is the spatial spillover effect. This can also be non-causal, for example, if we study the effect of temperature on crop yield, then the neighboring state’s temperature might be a good predictor of the focal state’s temperature. This is the spatial autocorrelation. In either case, we need to take account of the spatial structure in the data.
The reason behind the spatial pattern can be all kinds. It can be causing , or causing , or some unobserved factors causing both and , or some other reasons. Ignoring this spatial structure could make OLS biased. Betz et al (2020) show that ignoring spatial interdependence in the outcome results in asymptotically biased estimates even when instruments are randomly assigned.
Estimation with S2SLS for a less general model
Because we have the term , we introduced endogeneity. is necessarily correlated with , even if is not spatially correlated. Suppose is not spatially correlated.
The spatial two stage least squares (S2SLS) estimator is one solution. Usually we use and spatial lags of as instruments. Typically instruments is .
S2SLS is similar to 2SLS. The first stage is where and .
Then
The first stage is to regress on . The second stage is to regress on the predicted from the first stage and and . The coefficient of is the S2SLS estimator.
Estimation with a more general model
Suppose instead the error term is spatially correlated
Then the estimation is more complicated. The generalized spatial two stage least squares (GS2SLS) estimator is needed. See Kelejian and Prucha (1998) for details.
an example
This example is from R package “sphet”.
The Boston data contain information on property values and characteristics in the area of Boston, Massachusetts and has been widely used for illustrating spatial models. Specifically, there is a total of 506 units of observation for each of which a variety of attributes are avail- able, such as: (corrected) median values of owner-occupied homes (CMEDV); per-capita crime rate (CRIM); nitric oxides concentration (NOX); average number of rooms (RM); proportions of residential land zoned for lots over 25,000 sq. ft (ZN); proportions of non-retail business acres per town (INDUS); Charles River dummy variable (CHAS); proportions of units built prior to 1940 (AGE); weighted distances to five Boston employment centers (DIS); index of accessibil- ity to highways (RAD); property-tax rate (TAX); pupil-teacher ratios (PTRATIO); proportion of blacks (B); and % of the lower status of the population (LSTAT).
The dataset with Pace’s tract coordinates is available to the R community as part of “spdep”.
The spatial weights matrix is a sphere of influence neighbors list also available from spdep once the Boston data are loaded. Here the weight matrix comes with the data. Usually it’s the most heavy lifting part of the study.
library(sphet)
library(spdep)
## Loading required package: spData
## To access larger datasets in this package, install the spDataLarge
## package with: `install.packages('spDataLarge',
## repos='https://nowosad.github.io/drat/', type='source')`
## Loading required package: sf
## Linking to GEOS 3.10.2, GDAL 3.4.1, PROJ 8.2.1; sf_use_s2() is TRUE
data("boston", package = "spData")
listw <- nb2listw(boston.soi)
The function read.gwt2dist reads a ‘.GWT’ file (e.g., generated using the function distance). In this example we are using the matrix of tract point coordinates projected to UTM zone 19 boston.utm available from spdep to generate a ‘.GWT’ file of the 10 nearest neighbors.
id1 <- seq(1, nrow(boston.utm))
tmp <- distance(boston.utm, region.id = id1, output = TRUE,
type = "NN", nn = 10, shape.name = "shapefile",
region.id.name = "id1", firstline = TRUE,
file.name = "boston_nn_10.GWT")
coldist <- read.gwt2dist(file = "boston_nn_10.GWT", region.id = id1, skip = 1)
## Warning in read.gwt2dist(file = "boston_nn_10.GWT", region.id = id1, skip = 1):
## 1, 65 are not destinations
class(coldist)
## [1] "sphet" "distance" "nb" "GWT"
summary(coldist)
##
## Number of observations:
## n: 506
##
## Distance summary:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.5441 0.9588 1.5843 2.0848 2.6389 11.6388
##
## Neighbors summary:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 10 10 10 10 10 10
Spatial models in sphet are fitted by the functions stslshac and gstslshet. “stslshac” is Spatial two stages least square with HAC standard errors. “gstslshet” is GM estimation of a Cliff-Ord type model with Heteroskedastic Innovations, which is the general model we discussed before, with spatially correlated errors.
The function stslshac is used to estimate the S2SLS model.
res <- stslshac(log(CMEDV) ~ CRIM + ZN + INDUS + CHAS + I(NOX^2) + I(RM^2)
+ AGE + log(DIS) + log(RAD) + TAX + PTRATIO + B + log(LSTAT),
data = boston.c, listw,
distance = coldist, type = "Triangular", HAC = TRUE)
summary(res)
## ======================================================
## ======================================================
## Spatial Lag Model
## HAC standard errors
## ======================================================
## ======================================================
##
## Call:
## stslshac(formula = log(CMEDV) ~ CRIM + ZN + INDUS + CHAS + I(NOX^2) +
## I(RM^2) + AGE + log(DIS) + log(RAD) + TAX + PTRATIO + B +
## log(LSTAT), data = boston.c, listw = listw, HAC = TRUE, distance = coldist,
## type = "Triangular")
##
## Residuals:
## Min. 1st Qu. Median 3rd Qu. Max.
## -0.5356002 -0.0758562 -0.0045074 0.0719613 0.7128012
##
## Coefficients:
## Estimate SHAC St.Er. t-value Pr(>|t|)
## Wy 0.45924669 0.05282792 8.6933 < 2.2e-16 ***
## (Intercept) 2.40246917 0.28952447 8.2980 < 2.2e-16 ***
## CRIM -0.00735568 0.00157665 -4.6654 3.081e-06 ***
## ZN 0.00036435 0.00034007 1.0714 0.2839913
## INDUS 0.00119920 0.00161139 0.7442 0.4567549
## CHAS1 0.01192878 0.03432896 0.3475 0.7282275
## I(NOX^2) -0.28873634 0.11796316 -2.4477 0.0143778 *
## I(RM^2) 0.00669906 0.00206524 3.2437 0.0011798 **
## AGE -0.00025810 0.00047774 -0.5403 0.5890173
## log(DIS) -0.16042849 0.03681622 -4.3575 1.315e-05 ***
## log(RAD) 0.07170438 0.01606094 4.4645 8.025e-06 ***
## TAX -0.00036857 0.00009780 -3.7685 0.0001642 ***
## PTRATIO -0.01295698 0.00394072 -3.2880 0.0010091 **
## B 0.00028845 0.00013032 2.2134 0.0268689 *
## log(LSTAT) -0.23984212 0.03454865 -6.9422 3.862e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The function gstslshet is used to estimate the GS2SLS model.
res <- gstslshet(log(CMEDV) ~ CRIM + ZN + INDUS + CHAS + I(NOX^2) + I(RM^2)
+ AGE + log(DIS) + log(RAD) + TAX + PTRATIO + B + log(LSTAT),
data = boston.c, listw = listw, initial.value = 0.2)
summary(res)
##
## Call:
## gstslshet(formula = log(CMEDV) ~ CRIM + ZN + INDUS + CHAS + I(NOX^2) +
## I(RM^2) + AGE + log(DIS) + log(RAD) + TAX + PTRATIO + B +
## log(LSTAT), data = boston.c, listw = listw, initial.value = 0.2)
##
## Residuals:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -0.56939 -0.07316 -0.00168 0.00053 0.07150 0.74031
##
## Coefficients:
## Estimate Std. Error t-value Pr(>|t|)
## (Intercept) 2.51316605 0.26749367 9.3952 < 2.2e-16 ***
## CRIM -0.00662744 0.00144522 -4.5858 4.523e-06 ***
## ZN 0.00038299 0.00036563 1.0475 0.2948733
## INDUS 0.00159352 0.00179772 0.8864 0.3753967
## CHAS1 -0.00447974 0.03689065 -0.1214 0.9033480
## I(NOX^2) -0.27295899 0.11561412 -2.3609 0.0182283 *
## I(RM^2) 0.00744059 0.00199637 3.7270 0.0001937 ***
## AGE -0.00045400 0.00045572 -0.9962 0.3191333
## log(DIS) -0.16517174 0.03484858 -4.7397 2.140e-06 ***
## log(RAD) 0.07453521 0.01752830 4.2523 2.116e-05 ***
## TAX -0.00041956 0.00010763 -3.8981 9.697e-05 ***
## PTRATIO -0.01412661 0.00410143 -3.4443 0.0005725 ***
## B 0.00035970 0.00011182 3.2168 0.0012964 **
## log(LSTAT) -0.24593826 0.03213364 -7.6536 1.954e-14 ***
## lambda 0.42407826 0.04463747 9.5005 < 2.2e-16 ***
## rho 0.29587455 0.08614291 3.4347 0.0005932 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
In stata, you would use “spivregress” which replaced user-written “spivreg”. Stata has a set of commands for spatial econometrics. Check out stata’s “sp” manual.