Why do we need spatial econometrics?

‘Everything is related to everything else, but near things are more related than distant things’ (Tobler 1970). This is the first law of geography. This is the reason why we need spatial econometrics. The spatial econometrics is to account for the spatial dependence in the data. The spatial dependence is the correlation between the observations that are close to each other. This is a common feature in many datasets. For example, in the housing market, the price of a house is related to the prices of the houses nearby. In the crime data, the crime rate in one neighborhood is related to the crime rate in the neighboring neighborhoods. In the health data, the health outcomes of people in one area are related to the health outcomes of people in the neighboring areas. In the environmental data, the pollution level in one area is related to the pollution level in the neighboring areas. In the economic data, the economic performance of one region is related to the economic performance of the neighboring regions. In the social network data, the behavior of one person is related to the behavior of the people in the neighboring areas.

Econometricians love OLS. However, the robustness of OLS relies on i.i.d. data. If our data is spatially correlated, then OLS is biased, without taking account of the spatial structure.

A genaral linear spatial model

$y = λ W y + X β + W X γ + u$

$u = ρ W u + ϵ$

where $Y$ is the dependent variable, $X$ is the matrix of independent variables, $β$ is the vector of coefficients, and $u$ is the error term. The error term $u$ is spatially correlated.

We can see the major difference is $W$ . The term $W y$ is called “spatially lagged y”, which in my opinion is misleading. This term is likely from time-series counter-part of lagged $y$ , but there is nothing about lags in the spatial data. $W$ is the spatial weights matrix. It is a matrix that describes the spatial structure of the data. The simplest example of $W$ is the case that we define an element of $W$ , $w_{i, j}$ to be 1 if unit $i$ and $j$ are neighbors, 0 if not.

In this simplest case, our model is saying we have a pattern we need to take account of: neighboring units are related, they seem to display similar pattern in terms of our study. This commonality of neighbors can be causal, for example, if we study the effect of minimum wage effect on employment, then if the neighboring state’s wage increase might negatively impact the focal state’s employment, because people go for higher wages. This is the spatial spillover effect. This can also be non-causal, for example, if we study the effect of temperature on crop yield, then the neighboring state’s temperature might be a good predictor of the focal state’s temperature. This is the spatial autocorrelation. In either case, we need to take account of the spatial structure in the data.

The reason behind the spatial pattern can be all kinds. It can be $x_{i}$ causing $y_{j}$ , or $y_{i}$ causing $y_{j}$ , or some unobserved factors causing both $y_{i}$ and $y_{j}$ , or some other reasons. Ignoring this spatial structure could make OLS biased. Betz et al (2020) show that ignoring spatial interdependence in the outcome results in asymptotically biased estimates even when instruments are randomly assigned.

Estimation with S2SLS for a less general model

Because we have the term $W y$ , we introduced endogeneity. $W y$ is necessarily correlated with $u$ , even if $u$ is not spatially correlated. Suppose $u$ is not spatially correlated.

The spatial two stage least squares (S2SLS) estimator is one solution. Usually we use $X$ and spatial lags of $X$ as instruments. Typically instruments $H$ is $(X, W X, W^{2} X)$ .

S2SLS is similar to 2SLS. The first stage is $\hat{Z} = P Z = [X, \hat{W y}]$ where $\hat{W y} = P W y$ and $P = H (H^{'} H)^{- 1} H^{'}$ .

Then $γ_{S 2 S L S} = ({\hat{Z}}^{'} Z)^{- 1} {\hat{Z}}^{'} y$

The first stage is to regress $W Y$ on $H$ . The second stage is to regress $Y$ on the predicted $W Y$ from the first stage and $X$ and $W X$ . The coefficient of $X$ is the S2SLS estimator.

Estimation with a more general model

Suppose instead the error term is spatially correlated

$u = ρ W u + ϵ$

Then the estimation is more complicated. The generalized spatial two stage least squares (GS2SLS) estimator is needed. See Kelejian and Prucha (1998) for details.

an example

This example is from R package “sphet”.

The Boston data contain information on property values and characteristics in the area of Boston, Massachusetts and has been widely used for illustrating spatial models. Specifically, there is a total of 506 units of observation for each of which a variety of attributes are avail- able, such as: (corrected) median values of owner-occupied homes (CMEDV); per-capita crime rate (CRIM); nitric oxides concentration (NOX); average number of rooms (RM); proportions of residential land zoned for lots over 25,000 sq. ft (ZN); proportions of non-retail business acres per town (INDUS); Charles River dummy variable (CHAS); proportions of units built prior to 1940 (AGE); weighted distances to five Boston employment centers (DIS); index of accessibil- ity to highways (RAD); property-tax rate (TAX); pupil-teacher ratios (PTRATIO); proportion of blacks (B); and % of the lower status of the population (LSTAT).

The dataset with Pace’s tract coordinates is available to the R community as part of “spdep”.

The spatial weights matrix is a sphere of influence neighbors list also available from spdep once the Boston data are loaded. Here the weight matrix comes with the data. Usually it’s the most heavy lifting part of the study.

library(sphet)
library(spdep)

## Loading required package: spData

## To access larger datasets in this package, install the spDataLarge
## package with: `install.packages('spDataLarge',
## repos='https://nowosad.github.io/drat/', type='source')`

## Loading required package: sf

## Linking to GEOS 3.10.2, GDAL 3.4.1, PROJ 8.2.1; sf_use_s2() is TRUE

data("boston", package = "spData")
listw <- nb2listw(boston.soi)

The function read.gwt2dist reads a ‘.GWT’ file (e.g., generated using the function distance). In this example we are using the matrix of tract point coordinates projected to UTM zone 19 boston.utm available from spdep to generate a ‘.GWT’ file of the 10 nearest neighbors.

id1 <- seq(1, nrow(boston.utm))
tmp <- distance(boston.utm, region.id = id1, output = TRUE,
                type = "NN", nn = 10, shape.name = "shapefile",
                region.id.name = "id1", firstline = TRUE,
                file.name = "boston_nn_10.GWT")
coldist <- read.gwt2dist(file = "boston_nn_10.GWT", region.id = id1, skip = 1)

## Warning in read.gwt2dist(file = "boston_nn_10.GWT", region.id = id1, skip = 1):
## 1, 65 are not destinations

class(coldist)

## [1] "sphet"    "distance" "nb"       "GWT"

summary(coldist)

## 
##  Number of observations:
##  n: 506 
## 
##  Distance summary:
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.5441  0.9588  1.5843  2.0848  2.6389 11.6388 
## 
##  Neighbors summary:
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##      10      10      10      10      10      10

Spatial models in sphet are fitted by the functions stslshac and gstslshet. “stslshac” is Spatial two stages least square with HAC standard errors. “gstslshet” is GM estimation of a Cliff-Ord type model with Heteroskedastic Innovations, which is the general model we discussed before, with spatially correlated errors.

The function stslshac is used to estimate the S2SLS model.

res <- stslshac(log(CMEDV) ~ CRIM + ZN + INDUS + CHAS + I(NOX^2) + I(RM^2)
                + AGE + log(DIS) + log(RAD) + TAX + PTRATIO + B + log(LSTAT),
                data = boston.c, listw,
                distance = coldist, type = "Triangular", HAC = TRUE)
summary(res)

## ======================================================
## ======================================================
##                    Spatial Lag Model 
##                   HAC standard errors
## ======================================================
## ======================================================
## 
## Call:
## stslshac(formula = log(CMEDV) ~ CRIM + ZN + INDUS + CHAS + I(NOX^2) + 
##     I(RM^2) + AGE + log(DIS) + log(RAD) + TAX + PTRATIO + B + 
##     log(LSTAT), data = boston.c, listw = listw, HAC = TRUE, distance = coldist, 
##     type = "Triangular")
## 
## Residuals:
##       Min.    1st Qu.     Median    3rd Qu.       Max. 
## -0.5356002 -0.0758562 -0.0045074  0.0719613  0.7128012 
## 
## Coefficients:
##                Estimate SHAC St.Er. t-value  Pr(>|t|)    
## Wy           0.45924669  0.05282792  8.6933 < 2.2e-16 ***
## (Intercept)  2.40246917  0.28952447  8.2980 < 2.2e-16 ***
## CRIM        -0.00735568  0.00157665 -4.6654 3.081e-06 ***
## ZN           0.00036435  0.00034007  1.0714 0.2839913    
## INDUS        0.00119920  0.00161139  0.7442 0.4567549    
## CHAS1        0.01192878  0.03432896  0.3475 0.7282275    
## I(NOX^2)    -0.28873634  0.11796316 -2.4477 0.0143778 *  
## I(RM^2)      0.00669906  0.00206524  3.2437 0.0011798 ** 
## AGE         -0.00025810  0.00047774 -0.5403 0.5890173    
## log(DIS)    -0.16042849  0.03681622 -4.3575 1.315e-05 ***
## log(RAD)     0.07170438  0.01606094  4.4645 8.025e-06 ***
## TAX         -0.00036857  0.00009780 -3.7685 0.0001642 ***
## PTRATIO     -0.01295698  0.00394072 -3.2880 0.0010091 ** 
## B            0.00028845  0.00013032  2.2134 0.0268689 *  
## log(LSTAT)  -0.23984212  0.03454865 -6.9422 3.862e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The function gstslshet is used to estimate the GS2SLS model.

res <- gstslshet(log(CMEDV) ~ CRIM + ZN + INDUS + CHAS + I(NOX^2) + I(RM^2)
                 + AGE + log(DIS) + log(RAD) + TAX + PTRATIO + B + log(LSTAT),
                 data = boston.c, listw = listw, initial.value = 0.2)
summary(res)

## 
## Call:
## gstslshet(formula = log(CMEDV) ~ CRIM + ZN + INDUS + CHAS + I(NOX^2) + 
##     I(RM^2) + AGE + log(DIS) + log(RAD) + TAX + PTRATIO + B + 
##     log(LSTAT), data = boston.c, listw = listw, initial.value = 0.2)
## 
## Residuals:
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## -0.56939 -0.07316 -0.00168  0.00053  0.07150  0.74031 
## 
## Coefficients:
##                Estimate  Std. Error t-value  Pr(>|t|)    
## (Intercept)  2.51316605  0.26749367  9.3952 < 2.2e-16 ***
## CRIM        -0.00662744  0.00144522 -4.5858 4.523e-06 ***
## ZN           0.00038299  0.00036563  1.0475 0.2948733    
## INDUS        0.00159352  0.00179772  0.8864 0.3753967    
## CHAS1       -0.00447974  0.03689065 -0.1214 0.9033480    
## I(NOX^2)    -0.27295899  0.11561412 -2.3609 0.0182283 *  
## I(RM^2)      0.00744059  0.00199637  3.7270 0.0001937 ***
## AGE         -0.00045400  0.00045572 -0.9962 0.3191333    
## log(DIS)    -0.16517174  0.03484858 -4.7397 2.140e-06 ***
## log(RAD)     0.07453521  0.01752830  4.2523 2.116e-05 ***
## TAX         -0.00041956  0.00010763 -3.8981 9.697e-05 ***
## PTRATIO     -0.01412661  0.00410143 -3.4443 0.0005725 ***
## B            0.00035970  0.00011182  3.2168 0.0012964 ** 
## log(LSTAT)  -0.24593826  0.03213364 -7.6536 1.954e-14 ***
## lambda       0.42407826  0.04463747  9.5005 < 2.2e-16 ***
## rho          0.29587455  0.08614291  3.4347 0.0005932 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

In stata, you would use “spivregress” which replaced user-written “spivreg”. Stata has a set of commands for spatial econometrics. Check out stata’s “sp” manual.

Spatial econometrics introduction

Why do we need spatial econometrics?

A genaral linear spatial model

Estimation with S2SLS for a less general model

Estimation with a more general model

an example