Masayuki Kudamatsu
14-15 June, 2016
Reason 1: GIS makes more research feasible
(Figure 2.3 of Dell 2009)
e.g. Ethnic homelands in Africa by Murdock (1959)
Digitized by Nunn (2008)
Used by Nunn & Wantchekon (2011), Michalopoulos & Papaioannou (2013, 2014, 2015), Alsan (2015), Alesina et al. (2016), etc.
(Figure 5A of Alsan (2015))
e.g. DMSP-OLS Nighttime Lights
Used by Henderson et al (2012), Pinkovskiy & Sala-i-Martin (2016), Hodler & Raschky (2014), Michalopoulos & Papaioannou (2013, 2014), Alesina et al (2016), etc.
Reason 2: GIS makes identification more credible
e.g. Peer effect estimation
(Figure 4 of Conley & Udry (2010))
e.g. Distance
(Figure 5 of Nunn 2008)
(Figure 2 of Dell (2010))
1. GIS basics
2. Create spatial datasets on your own
3. Merge spatial datasets
4. Elevation
5. Distance
6. Spatial regression discontinuity
7. Surface area
8. Map Algebra
Data type
Coordinate systems
GIS software
Spatial data comes in two different formats: Vector & Raster
How to edit data differs a lot between them
Comes in three formats:
File format: Shapefile (.shp)
Source: GAUL
Source: GAUL
Source: GREG
Source: Natural Earth
Source: gROADS
Source: Natural Earth
Source: Natural Earth
Source: ACLED
Each unit: called a feature
Comes w/ attribute table
Divides the earth surface into many "square" cells (or pixels)
Each cell contains one value
Resolution: 2.5 arc-minutes
Source: Gridded Population of the World
Resolution: 30 arc-seconds
Source: STRM30
Resolution: 30 arc-seconds
Source: DMSP-OLS
Earth is a sphere (approximately)
Various ways to two-dimensionally represent Earth
Each way corresponds to a coordinate system
To merge different spatial datasets accurately
cf. Apple Map did this wrong when it was launched in 2012
To calculate distance and surface area properly
Each location is coded by angle from earth center
e.g. Stockholm: 59.3293° N / 18.0686° E
Most popular: WGS 1984
Earth surface is projected by "light" from earth center on:
Cylinder
Earth surface is projected by "light" from earth center on:
Cone
Earth surface is projected by "light" from earth center on:
Plane
Each location: coded in meters from a certain origin
Examples (relevant for social scientists):
We will learn these projections later.
Map Projections: A Working Manual, by John P. Snyder (U.S. Geological Survey, 1987) (Downloadable for free)
ArcGIS
QGIS
$\Rightarrow$ ArcGIS recommended for the ease of use of Python (for replication), at least for now
R
Satellite images
Scanned old maps
Point locations
Grid cells
Images consist of pixels
Map each pixel's "color" into raster value
A lot of time (and money to hire experts) needed, though.
Some satellite images: freely available
Examples of constructing data from satellite images
# of districts w/i province $\uparrow$ $\Rightarrow$ Deforestation $\uparrow$
Theory:
Cannot rely on official stats of logging
$\Rightarrow$ Use satellite images
(Figure I of Burgess et al. 2012)
(Figure II of Burgess et al. 2012)
First, geo-reference the map
Then, create vector data by tracing lines w/ mouse
Also time-consuming but feasible with patience
Did Kenyan presidents build more roads for their co-ethnics?
Digitize Michelin maps for Kenya since 1961
Track road network expansion over time
(source: Remi Jedwab's presentation slide)
First, create a table in text format, where:
GPS receiver
Online gazetteer
geocode3
automates search
To convert the text file table into point vector data in ArcGIS, use:
These are the examples of geo-processing tools
import arcpy
input_table = "coordiates.txt"
output_shp = "points.shp"
varname_x = "longitude"
varname_y = "latitude"
coordinate_system = arcpy.SpatialReference(4326)
arcpy.MakeXYEventLayer_management(
input_table, varname_x, varname_y,
"Layer", coordinate_system)
arcpy.CopyFeatures_management("Layer", output_shp)
Any size of grid cell polygons can be created in ArcGIS
Useful for:
To create grid cell polygons in ArcGIS, use:
These are also the examples of geo-processing tools
import arcpy
output_shp = "gridcells25.shp"
cellsize = "2.5"
bottom_left = "-180 -65"
top_right = "180 85"
y_axis = "-180 -55"
coordinate_system = arcpy.SpatialReference(4326)
arcpy.CreateFishnet_management(
output_shp, bottom_left, y_axis,
cellsize, cellsize, "0", "0", top_right,
"NO_LABELS", "", "POLYGON")
arcpy.DefineProjection_management(
output_shp, coordinate_system)
Master ArcGIS = Know which geo-processing tools to use
Takes vector/raster data as inputs
Most will create new vector/raster data
Can be executed in Python
Introduce each geo-processing tool
Demonstrate how it's used by economists
Add new variables from a second vector data
Based on location
merge
One useful application: merge with weather data
Weather data: available at grid cell level
$\Rightarrow$ Spatial Join specifies which grid cells are relevant for each observation
import arcpy
target = "cities.shp"
join = "weather_data_cells.shp"
output = "cities_with_weather_data.shp"
arcpy.SpatialJoin_analysis(
target, join, output)
European colonization $\Rightarrow$ Economic development?
Sample: islands (geo-referenced)
IVs for duration of colonization:
Wind data: CERSAT (1° x 1°)
Tsetse flies $\Rightarrow$ Africa's underdevelopment?
Weather in 1871 at 2° x 2° resolution
Temperature & humidity fed into a model to predict Tsetse fly survival
Matched with Ethnographic Atlas
Intersect: Creates intersection features
Dissolve: Combines features by key variables
Can also create summary statistics
collapse
Can calculate # of polygons/polylines w/i zone polygon
import arcpy
inFeatures = ["counties.shp", "streams.shp"]
intersectOutput = "streams_by_country.shp"
dissolve_field = "county_id"
outFeatures = "counties_number_of_streams.shp"
arcpy.Intersect_analysis(
inFeatures, intersectOutput)
arcpy.Dissolve_management(
intersectOutput, outFeatures,
dissolve_field, "COUNT")
Competition $\Rightarrow$ School quality $\uparrow$?
IV: # of streams w/i city
In early 20c, Imperial China abolished 1300-year-old civil service exams
Prefectures w/ higher quota $\Rightarrow$ more uprisings during the 1911 Revolution
IV: # of streams w/i prefecture
Calculates sum stat of raster values w/i zone
Zone: defined by polygon or polyline
collapse
by zone, executed on raster data
For integer raster:
If unit of analysis is point, use either:
import arcpy
arcpy.CheckOutExtension("spatial")
# Inputs
elevation = "srtm30.dem"
districts = "district.shp"
# Intermediate
zonalstat = "zonalstat.dbf"
# Outputs
mean_elevation = "mean_elevation.xls"
arcpy.gp.ZonalStatisticsAsTable_sa(
districts, "dist_id", elevation,
zonalstat, "DATA", "MEAN")
arcpy.TableToExcel_conversion(
zonalstat, mean_slope)
Pixel-level | District-level | |
$\Rightarrow$ |
Obtain the mean cell value by:
Temperature $\uparrow$ $\Rightarrow$ Economic growth $\downarrow$?
Use population-weighted average temperature
(Figure 2.3 of Dell 2009)
Geographic diversity $\Rightarrow$ Ethnic diversity?
Empirical challenge:
$\Rightarrow$ 2.5°x2.5° grid cells as units of analysis
How to measure ethnic & geographic diversity, then?
ArcGIS helps:
Returns either $\theta$ or $\tan\theta$ in diagram below
$\tan\theta$ for cell $e$ is obtained by
\begin{align*} \Rightarrow \ \tan \theta = \sqrt{(dz/dx)^2+(dz/dy)^2} \end{align*}
where
\begin{align*} \frac{dz}{dx} = & \ \Big[\frac{c + 2f + i}{4} - \frac{a+2d+g}{4}\Big] / 2 \\ \frac{dz}{dy} = & \ \Big[\frac{a+2b+c}{4} - \frac{g + 2h + i}{4}\Big] / 2 \end{align*}
Electrification in South Africa (1996-2001)
$\Rightarrow$ Female labor supply $\uparrow$
IV: mean land slope
Slope (lighter = flatter) | Electrification (red) |
Tea production $\uparrow$ in China due to liberalization in 1979
$\Rightarrow$ Male-to-female ratio $\downarrow$
IV: mean land slope
Slope (darker = steeper) | Tea (darker = more) |
import arcpy
arcpy.CheckOutExtension("spatial")
# Inputs
elevation = "srtm30.dem"
districts = "district.shp"
# Intermediates
slope = "slope.tif"
zonalstat = "zonalstat.dbf"
arcpy.gp.Slope_sa(
elevation, slope, "PERCENT_RISE", "0,000009")
arcpy.gp.ZonalStatisticsAsTable_sa(
districts, "dist_id", slope,
zonalstat, "DATA", "MEAN")
Reclassify: Creates categorical raster data
Example: a dummy variable for slope 3-6%
import arcpy
arcpy.CheckOutExtension("spatial")
# Inputs
elevation = "srtm30.dem"
# Outputs
slope = "slope.tif"
slope_3_6 = "slope3to6.tif"
arcpy.gp.Slope_sa(
elevation, slope, "PERCENT_RISE", "0,000009")
arcpy.gp.Reclassify_sa(
slope, "Value",
"0 3 0;3 6 1;6 193,22999999999999 0",
slope_3_6, "DATA")
Irrigation dams $\Rightarrow$ Poverty $\downarrow$?
IV: Fraction of river areas in three slope ranges
$\Leftarrow$ Easy to build if river slope is:
River slope (darker = steeper) | Dams (darker = more) |
Intersect + Dissolve $\Rightarrow$ River by districts
Slope + Reclassify $\Rightarrow$ Indicator for each slope range
Zonal Statistics as Table $\Rightarrow$ Fraction of river areas in each slope range by district
Measures % of areas with slope > 15% as unsuitability for urban development
Finds housing supply inelastic in such areas
Slope of over 15%: now often used as geographic constraints to housing supply / urban development
Used by radio/tv enginneers to predict signal reception
(Figure 2 of Olken (2009))
# of TV channels $\uparrow$ in Indonesia $\Rightarrow$ Social capital $\downarrow$
IV: TV signal strength
Anti-Hutu radio $\Rightarrow$ Rwandan genocide incidents $\uparrow$
IV: radio signal strength
Coordinate systems for distance
Buffer
Feature To Point + Near
Generate Near Table
Two approaches:
1. Geographic coordinate systems
2. UTM projections
1. Obtain geographic coordinates (How?)
2. Use the Great Circle Distance formula
\begin{align*} d_{ij} =& \ 111.12 \times \cos^{-1}\big[\sin(La_i)\sin(La_j) \\ & + \cos(La_i)\cos(La_j)\cos(Lo_i-Lo_j)\big] \end{align*}
globdist
Obtain their centroids by:
shp2dta
Then use globdist
Obtain nearest point on polyline by:
Near tool calculates distance as well
(UTM = Universal Transverse Mercator)
Project earth surface onto the cylinder that is tangent on standard meridian
Father away from standard meridian, more distortion
To minimize distortion:
1. Divide Earth into 60 zones (6° wide in longitude)
2. For each zone, set standard meridian in the middle
3. Scale down distance along standard meridian by 0.9996
Once projected, use Pythagoras' Theorem
Tools such as Buffer and Near calculate distance
GEODESIC option for geographic coordinates
PLANAR option for UTM projections
e.g. River lengths, road distances
Geographic coordinates cannot be used
$\Rightarrow$ Use UTM projections
If the study area is large, cut polylines by UTM zones
Project
Add Field + Calculate Field with expression:
float(!shape.length!)
Project (for polylines)
Project Raster (for elevation raster)
Add Surface Information, with SURFACE_LENGTH option
Duflo & Pande (2007): length of rivers by district as control
Dell (2010): length of roads by district as intermediate outcome
Use ArcGIS tools such as:
See Melissa Dell's lecture note (pp. 18-25) for detail
Creates neighborhood polygons
Size of neighborhood can be specified
For points: a circle of $x$-meter/feet radius
For polylines: | |
For polygons: |
One useful application:
Identify other features in the neighborhood
$\Leftarrow$ Use buffer polygons as target features in Spatial Join with JOIN_ONE_TO_MANY options
import arcpy
# Identify other villages within 20-mile radius
in_features = "villages.shp"
radius = "20 miles"
buffer = "village_buffer.shp"
out_feature_class = "village_neighbors.shp"
arcpy.Buffer_analysis(in_features, buffer, radius)
arcpy.SpatialJoin_analysis(
buffer, in_features,
out_feature_class, "JOIN_ONE_TO_MANY")
Deworming school children $\Rightarrow$ Infection in neighborhood $\downarrow$?
Farmer's use of new technology $\Rightarrow$ Learning by friends?
Control for average use in neighborhood (1km radius)
Shale gas $\Rightarrow$ House price $\downarrow$?
Treatment: # of drilled wells w/i 2km radius of each house
Sample restriction: 1km buffer along water service area boundary
(Figure 6 of Muelenbachs et al. 2015)
Feature To Point : Creates centroids of input features
Near: Finds nearest feature to each input feature
Also calculates distance to it
If nearest features are polyline / polygon
$\Rightarrow$ Can obtain coordinate of nearest point
Slave trade $\Rightarrow$ Africa's underdevelopment?
IV for slave trade: Distance to nearest slave trade centers
Feature To Point $\Rightarrow$ Country centroids
Near w/ coastline $\Rightarrow$ Coordinates of nearest coastal point
Make XY Event Layer + Copy Feature $\Rightarrow$ Nearest coastal point
Near w/ slave trade centers $\Rightarrow$ Distance to nearest center
Location of slave trade centers:
Distance to slave markets $\neq$ Distance to other economic opportunities
Maybe a smaller impact if countries voluntarily engaged in slave trades
But LATE may be of more interest in this context
In Rwandan genocide:
Presence of armed groups $\Rightarrow$ civilian participation $\uparrow$?
Instrument:
Distance to main road
x
Rainfall along the dirt track to main road during genocide
Distance to main road
Rainfall along the dirt track to main road during genocide
Calculates distance to many features, besides nearest one
Population concentration around capital
$\Rightarrow$ US state govt quality $\uparrow$
Measure population concentration around capital by:
Air pollution $\Rightarrow$ Infant mortality in California?
Data:
(taken from Neidell 2004)
How to construct pollution measure at zip-code level?
Feature To Point
$\Rightarrow$ Zip-code zone centroids
Generate Near Table
$\Rightarrow$ Distance from zip-code centroid to each monitor
Average pollution by weighting each monitor with inverse distance (within 20 miles)
Forcing variable: two-dimensional vector (i.e. coordinates)
Cutoff: boundary lines
$\Rightarrow$ How should we extend the standard RD design?
\begin{align*} y_{i} = \beta T_{i} + \gamma Dist_i + \delta T_{i} Dist_i + \mu_s + \varepsilon_{i} \end{align*}
Project coordinates into distance to boundary ($Dist_i$)
\begin{align*} y_{i} = \beta T_{i} + \gamma Dist_i + \delta T_{i} Dist_i + \mu_s + \varepsilon_{i} \end{align*}
Then use standard RD design w/ boundary segment FE ($\mu_s$)
w/o boundary segment FE, you compare point $i$ to $k$ in diagram below
(Figure 3 of Keele & Titiunik 2014)
\begin{align*} y_{i} = \beta T_{i} + \boldsymbol{x}'_i\boldsymbol{\gamma} + T_{i}\boldsymbol{x}'_i\boldsymbol{\delta} + \mu_n + \varepsilon_{i} \end{align*}
Use coordinates ($\boldsymbol{x}_i$) for RD polynomials
\begin{align*} y_{i} = \beta T_{i} + \boldsymbol{x}'_i\boldsymbol{\gamma} + T_{i}\boldsymbol{x}'_i\boldsymbol{\delta} + \mu_n + \varepsilon_{i} \end{align*}
Pick (equally-spaced) points on boundary (denoted by $n$)
Assign each observation to its nearest boundary point, to define $\mu_n$
\begin{align*} y_{i} = \beta T_{i} + \boldsymbol{x}'_i\boldsymbol{\gamma} + T_{i}\boldsymbol{x}'_i\boldsymbol{\delta} + \mu_n + \varepsilon_{i} \end{align*}
Treatment effect at boundary point $n$: given by
\begin{align*} \beta + \boldsymbol{x}'_n\boldsymbol{\delta} \end{align*}
See Chapter 2 of Zajonc (2012) (commonly cited as Imbens and Zajonc 2011) for more detail
1. Attach treatment indicator to zone polygons
2. Treatment boundary polylines
3. Boundary segment indicator & Distance to boundary
Create a text file table of:
Table To Table to convert it into dBASE format
Join Field to merge the data by polygon ID
merge
Select to create two polygon shapefiles
Intersect with output type LINE
First, split boundary polyline into segments of equal length
!shape!.positionAlongLine(0.5,True).firstPoint.X
Then, use Near
Does forced labor system during Spanish colonial rule affect todya's living standards in Peru? If so, why?
Ntnl govt quality $\Rightarrow$ development?
Look at ethnic homelands split by national boarders in Africa
Find no difference in nighttime light across border on average
Large heterogeneity across ethnic groups, though
Does higher TV license fees increase evasion in Austria?
Exploit differences in fees across states
Focus on state borders where covariates are balanced
Cellphone coverage $\Rightarrow$ Electoral fraud $\downarrow$ in Afghanistan
1000+ polling stations w/i 5km from coverage boundary
$\Rightarrow$ Boundary RD approach: feasible
Geographic coordinate systems: not suitable
$\Rightarrow$ Use either:
Shrink latitude to compensate shorter unit of longitude towards Poles
Standard map projection for USA
Shrink longitude by $\cos(latitude)$
All give the (approximately) correct surface area
Difference is in how the projected map looks
Project
Add Field + Calculate Field with expression:
float(!SHAPE.AREA!)
Assign # of slaves from each ethnic group into different countries by surface area
Figure II of Nunn (2008)
Cell-by-cell calculation across multiple raster datasets
1. Arithmetic operations
raster*2
)
ras1 + ras2
)
2. Functions
What caused the formation of the state?
Focus on crop type (cereals vs roots/tubers)
Cassava, yam, taro, bananas...
Perishable upon harvest
Harvesting is non-seasonal
$\Rightarrow$ No incentive to steal / confiscate
Wheat, rice, maize...
Storable
Harvest within a short period of time
$\Rightarrow$ Incentive to steal / confiscate
Building a state incurs a fixed cost
Bandits have no incentive to build a state with roots & tubers
Historical examples
Resolution: 5 x 5 arc-minutes (about 10 X 10 km)
Data: potential yields based on climate & soil
$\Rightarrow$ Exogeneous to human activities
$\Rightarrow$ Widely used by economists
1. Convert yields into calorie for 15 crops
# Setting up
import arcpy
arcpy.CheckOutExtension("spatial")
# Input raster
maize_yield = arcpy.sa.Raster("maize.tif")
# Map algebra
maize_calorie = maize_yield * 36.5
# Save output
maize_calorie.save("calorie_cereal_maize.tif")
2. Obtain maximum for each crop type
# Maximum calorie by cereal crops
input_cereals = arcpy.ListRasters(
"calorie_cereal_*", "TIF")
max_cereal = CellStatistics(
input_cereals, "MAXIMUM", "DATA")
# Maximum calorie by tuber/root crops
input_tubers = arcpy.ListRasters(
"calorie_tuber_*", "TIF")
max_tuber = CellStatistics(
input_tubers, "MAXIMUM", "DATA")
2. Obtain maximum for each crop type
# Maximum calorie by cereal crops
input_cereals = arcpy.ListRasters(
"calorie_cereal_*", "TIF")
max_cereal = CellStatistics(
input_cereals, "MAXIMUM", "DATA")
# Maximum calorie by tuber/root crops
input_tubers = arcpy.ListRasters(
"calorie_tuber_*", "TIF")
max_tuber = CellStatistics(
input_tubers, "MAXIMUM", "DATA")
3. Take difference
# Cereal's caloric advantage over tubers
caloric_diff = max_cereal - max_tuber
# Save the output
caloric_diff.save("caloric_diff.tif")
Terrain ruggedness $\Rightarrow$ income per capita
Negative impact outside Africa
Positive impact in Africa
Originally proposed by Riley et al. (1999)
Defined as: \begin{align*} TRI_{xy} =& \sqrt{\sum_{i=x-1}^{x+1} \sum_{j=y-1}^{y+1} (e_{ij}-e_{xy})^2} \end{align*}
$e_{xy}$ | Elevation at longitude $x$ latitude $y$ |
Can be obtained by Focal Statistics + Map Algebra
Raster cell (30x30 arc-sec) level data is downloadable from Diego Puga's website
Calculates summary statistics in neighbouring raster cells
e.g. Sum of immediate neighbors
You can define the neighborhood very flexibly (more detail)
Rectangle
You can define the neighborhood very flexibly (more detail)
Circle
You can define the neighborhood very flexibly (more detail)
Annulus (ring)
You can define the neighborhood very flexibly (more detail)
Wedge
You can define the neighborhood very flexibly (more detail)
Irregular
You can define the neighborhood very flexibly (more detail)
Weight
Expand the expression inside the square root:
\begin{align*} TRI_{xy} =& \sqrt{\sum_i \sum_j (e_{ij}-e_{xy})^2} \\ =& \sqrt{\sum_i \sum_j (e_{ij})^2 - 2e_{xy}\sum_i\sum_j e_{ij} + 9 (e_{xy})^2} \end{align*}
\begin{align*} TRI_{xy} =& \sqrt{\sum_i \sum_j (e_{ij})^2 - 2e_{xy}\sum_i\sum_j e_{ij} + 9 (e_{xy})^2} \end{align*}
Map Algebra calculates $(e_{ij})^2$:
elev = Raster("srtm30.tif")
elev_sq = elev**2
Expand the expression inside the square root:
\begin{align*} TRI_{xy} =& \sqrt{\sum_i \sum_j (e_{ij})^2 - 2e_{xy}\sum_i\sum_j e_{ij} + 9 (e_{xy})^2} \end{align*}
Focal Statistics calculates $\sum_i \sum_j (e_{ij})^2$ and $\sum_i\sum_j e_{ij}$:
sum_elev_sq = FocalStatistics(elev_sq, "", "SUM", "")
sum_elev = FocalStatistics(elev, "", "SUM", "")
Expand the expression inside the square root:
\begin{align*} TRI_{xy} =& \sqrt{\sum_i \sum_j (e_{ij})^2 - 2e_{xy}\sum_i\sum_j e_{ij} + 9 (e_{xy})^2} \end{align*}
Map Algebra sums them up and takes square root:
TRI_square = sum_elev_sq - 2*elev*sum_elev + 9*elev_sq
TRI = SquareRoot(TRI_square)
TRI.save("ruggedness.tif")
State capacity in one municipality
$\Rightarrow$ State capacity & prosperity in neighboring municipalities
Divide surface by nearest point (aka Voronoi partition)
Alesina et al. (2016) use it as a robustness check to Murdock's ethnic homeland boundaries
Alsan (2015) (cf. Lec 2) uses it as an alternative to the 20-miles radius of Ethnographic Atlas society location
Estimate the externality of military efforts on allied partners' effort in DR Congo civil wars
Use rainfall shock as instruments
But how do we measure the location of each military actor?
Minimum Bounding Geometry tool with Convex Hull option