Chapter 8 Spatial Data Visualization and Analysis

8.1 Overview

Recall, a data structure is a format for organizing and storing data. The structure is designed so that data can be accessed and worked with in specific ways. Statistical software and programming languages have methods (or functions) designed to operate on different kinds of data structures.

This chapter focuses on spatial data structures and some of the R functions that work with these data. Spatial data comprise values associated with locations, such as temperature data at a given latitude, longitude, and perhaps elevation. Spatial data are typically organized into vector or raster data types. (See Figure 8.1).

  • Vector data represent features such as discrete points, lines, and polygons.
  • Raster data represent surfaces as a rectangular matrix of square cells or pixels.
Raster/Vector Comparison [@imageRaster]

FIGURE 8.1: Raster/Vector Comparison (Wegmann 2010)

Whether or not you use vector or raster data depends on the type of problem, the type of maps you need, and the data source. Each data structure has strengths and weaknesses in terms of functionality and representation. As you gain more experience working with spatial data, you will be able to determine which structure to use for a particular application.

There is a large set of R packages available for working with spatial (and space-time) data. These packages are described in the Cran Task View: Analysis of Spatial Data. The CRAN task view attempts to organize the various packages into categories, such as Handling spatial data, Reading and writing spatial data, Visualization, and Disease mapping and areal data analysis, so users can quickly identify package options given their project needs.

Exploring the extent of the excellent spatial data tools available in R is beyond the scope of this book. Rather, we would point you to subject texts like Applied Spatial Data Analysis with R by Bivand, Pebesma, and Gomez-Rubio (2013) (available for free via the MSU library system), and numerous online tutorials on pretty much any aspect of spatial data analysis with R. These tools make R a full-blown Geographic Information System (GIS) capable of spatial data manipulation and analysis on par with commercial GIS systems such as ESRI’s ArcGIS.

8.1.1 Some Spatial Data Packages

This chapter will focus on a few R packages for manipulating and visualizing spatial data. Specifically we will touch on the following packages

  • sp: spatial data structures and methods
  • rgdal: interface to the C/C++ spatial data Geospatial Data Abstraction Library
  • ggmap: extends ggplot2 language to handle spatial data
  • leaflet: generates dynamic online maps

8.2 Motivating Data

We motivate the topics introduced in this chapter using some forestry data from the Penobscot Experimental Forest (PEF) located in Maine. The PEF is a long-term experimental forest that is used to understand the effects of silviculture (i.e., science of tending trees) treatments on forest growth and composition. The PEF is divided into non-overlapping management units that receive different harvesting treatments. Within each management unit is a series of observation point locations (called forest inventory plots) where forest variables have been measured. Ultimately, we want to summarize the inventory plots measurements by management unit and map the results.

8.3 Reading Spatial Data into R

Spatial data come in a variety of file formats. Examples of popular vector file formats for points, lines, and polygons, include ESRI’s shapefile and open standard GeoJSON. Common raster file formats include GeoTIFF and netCDF43.

The rgdal function readOGR will read a large variety of vector data file formats (there is also a writeOGR() for writing vector data files). Raster data file formats can be read using the rgdal function readGDAL (yup, also a writeGDAL()) or read functions in the raster package. All of these functions automatically cast the data into an appropriate R spatial data object (i.e., data structure), which are defined in the sp or raster packages. Table 8.1 provides an abbreviated list of these R spatial objects44. The Without attributes column gives the sp package’s spatial data object classes for points, lines, polygons, and raster pixels that do not have data associated with the spatial objects (i.e., without attributes in GIS speak). DataFrame is appended to the object class name once data, in the form of variables, are added to the spatial object.

TABLE 8.1: An abbreviated list of sp and raster data objects and associated classes for the fundamental spatial data types
Without Attributes With Attributes
Polygons SpatialPolygons SpatialPolygonsDataFrame
Points SpatialPoints SpatialPointsDataFrame
Lines SpatialLines SpatialLinesDataFrame
Raster SpatialGrid SpatialGridDataFrame
Raster SpatialPixels SpatialPixelsDataFrame
Raster RasterLayer
Raster RasterBrick
Raster RasterStack

You can create your own spatial data objects in R. Below, for example, we create a SpatialPoints object consisting of four points. Then add some data to the points to make it a SpatialPointsDataFrame.

library(sp)
library(dplyr)

x <- c(3,2,5,6)
y <- c(2,5,6,7)

coords <- cbind(x, y)

sp.pts <- SpatialPoints(coords)

class(sp.pts)
## [1] "SpatialPoints"
## attr(,"package")
## [1] "sp"
some.data <- data.frame(var.1 = c("a", "b", "c", "d"), var.2 = 1:4)

sp.pts.df <- SpatialPointsDataFrame(sp.pts, some.data)

class(sp.pts.df)
## [1] "SpatialPointsDataFrame"
## attr(,"package")
## [1] "sp"

If, for example, you already have a data frame that includes the spatial coordinate columns and other variables, then you can promote it to a SpatialPointsDataFrame by indicating which columns contain point coordinates. You can extract or access the data frame associated with the spatial object using @data. You can also access individual variables directly from the spatial object using $ or by name or column number to the right of the comma in [,] (analogues to accessing variables in a data frame).

df <- data.frame(x = c(3,2,5,6), 
         y=c(2,5,6,7), 
         var.1 = c("a", "b", "c", "d"), 
         var.2 = 1:4)
class(df)
## [1] "data.frame"
#promote to a SpatialPointsDataFrame
coordinates(df) <- ~x+y

class(df)
## [1] "SpatialPointsDataFrame"
## attr(,"package")
## [1] "sp"
#access entire data frame
df@data
##   var.1 var.2
## 1     a     1
## 2     b     2
## 3     c     3
## 4     d     4
class(df@data)
## [1] "data.frame"
#access columns directly
df$var.1
## [1] "a" "b" "c" "d"
df[,c("var.1","var.2")]
##   coordinates var.1 var.2
## 1      (3, 2)     a     1
## 2      (2, 5)     b     2
## 3      (5, 6)     c     3
## 4      (6, 7)     d     4
df[,2]
##   coordinates var.2
## 1      (3, 2)     1
## 2      (2, 5)     2
## 3      (5, 6)     3
## 4      (6, 7)     4
#get the bounding box
bbox(df)
##   min max
## x   2   6
## y   2   7

Here, the data frame df is promoted to a SpatialPointsDataFrame by indicating the column names that hold the longitude and latitude (i.e., x and y respectively) using the coordinates function. Here too, the @data is used to retrieve the data frame associated with the points. We also illustrate how variables can be accessed directly from the spatial object. The bbox function is used to return the bounding box that is defined by the spatial extent of the point coordinates. The other spatial objects noted in Table 8.1 can be created, and their data accessed, in a similar way45.

More than often we find ourselves reading existing spatial data files into R. The code below uses the downloader package to download all of the PEF data we’ll use in this chapter. The data are compressed in a single zip file, which is then extracted into the working directory using the unzip function. A look into the PEF directory using list.files shows nine files46. Those named MU-bounds.* comprise the shapefile that holds the PEF’s management unit boundaries in the form of polygons. Like other spatial data file formats, shapefiles are made up of several different files (with different file extensions) that are linked together to form a spatial data object. The plots.csv file holds the spatial coordinates and other information about the PEF’s forest inventory plots. The roads.* shapefile holds roads and trails in and around the PEF.

library(downloader)

download("https://www.finley-lab.com/files/data/PEF.zip", 
         destfile="./PEF.zip", mode="wb") 

unzip("PEF.zip", exdir = ".")

list.files("PEF")
##  [1] "MU-bounds.dbf" "MU-bounds.prj" "MU-bounds.qpj"
##  [4] "MU-bounds.shp" "MU-bounds.shx" "plots.csv"    
##  [7] "plots.dbf"     "plots.prj"     "plots.qpj"    
## [10] "plots.shp"     "plots.shx"     "roads.dbf"    
## [13] "roads.prj"     "roads.shp"     "roads.shx"

Next we read the MU-bounds shapefile into R using readOGR()47 and explore the resulting mu object. Notice that when we read a shapefile into R, we do not include a file extension with the shapefile name because a shapefile is always composed of multiple files.

library(rgdal)
mu <- readOGR("PEF", "MU-bounds")
## OGR data source with driver: ESRI Shapefile 
## Source: "/home/jeffdoser/Dropbox/teaching/for875/textbook/PEF", layer: "MU-bounds"
## with 40 features
## It has 1 fields

When called, the readOGR function provides a bit of information about the object being read in. Here, we see that it read the MU-bounds shapefile from PEF directory and the shapefile had 40 features (i.e., polygons) and 1 field (i.e., field is synonymous with column or variable in the data frame).

You can think of the resulting mu object as a data frame where each row corresponds to a polygon and each column holds information about the polygon48. More specifically, the mu object is a SpatialPolygonsDataFrame.

class(mu)
## [1] "SpatialPolygonsDataFrame"
## attr(,"package")
## [1] "sp"

As illustrated using the made-up point data in the example above, you can access the data frame associated with the polygons using @data.

class(mu@data)
## [1] "data.frame"
dim(mu@data)
## [1] 40  1
head(mu@data)
##   mu_id
## 0   C15
## 1   C17
## 2   C16
## 3   C27
## 4   U18
## 5   U31

Above, a call to class() confirms we have accessed the data frame, dim() shows there are 40 rows (one row for each polygon) and one column, and head() shows the first six values of the column named mu_id. The mu_id values are unique identifiers for each management unit polygon across the PEF.

8.4 Coordinate Reference Systems

One of the more challenging aspects of working with spatial data is getting used to the idea of a coordinate reference system. A coordinate reference system (CRS) is a system that uses one or more numbers, or coordinates, to uniquely determine the position of a point or other geometric element (e.g., line, polygon, raster). For spatial data, there are two common coordinate systems:

  1. Spherical coordinate system, such as latitude-longitude, often referred to as a geographic coordinate system.
  2. Projected coordinate system based on a map projection, which is a systematic transformation of the latitudes and longitudes that aims to minimize distortion occurring from projecting maps of the earth’s spherical surface onto a two-dimensional Cartesian coordinate plane. Projected coordinate systems are sometimes referred to as map projections.

There are numerous map projections49. One of the more frustrating parts of working with spatial data is that it seems like each data source you find offers its data in a different map projection and hence you spend a great deal of time reprojecting (i.e., transforming from one CRS to another) data into a common CRS such that they overlay correctly. Reprojecting is accomplished using the sp package’s spTransform function as demonstrated in Section 8.5.

In R, a spatial object’s CRS is accessed via the sp package proj4string function. The code below shows the current projection of mu.

proj4string(mu)
## [1] "+proj=utm +zone=19 +datum=NAD83 +units=m +no_defs"

The cryptic looking string returned by proj4string() is a set of directives understood by the proj.4 C library, which is part of sp, and used to map geographic longitude and latitude coordinates into the projected Cartesian coordinates. Note that if you see a warning message you can safely ignore it. This CRS tells us the mu object is in Universal Transverse Mercator (UTM) zone 19 coordinate system.50

8.5 Illustration using ggmap

Let’s begin by making a map of PEF management unit boundaries over top of a satellite image using the ggmap package. Given an address, location, or bounding box, the ggmap package’s get_map function will query Google Maps, OpenStreetMap, Stamen Maps, or Naver Map servers for a user-specified map type. The get_map function requires the location or bounding box coordinates be in a geographic coordinate system (i.e., latitude-longitude). This means we need to reproject mu from UTM zone 19 to latitude-longitude geographic coordinates, which is defined by the '"proj=longlat +datum=WGS84"' proj.4 string. As seen below, the first argument in spTransform function is the spatial object to reproject and the second argument is a CRS object created by passing a proj.4 string into the CRS function.

mu <- spTransform(mu, CRS("+proj=longlat +datum=WGS84"))
proj4string(mu)
## [1] "+proj=longlat +datum=WGS84 +no_defs"

Unfortunately, we cannot just feed the SpatialPolygonsDataFrame mu into ggplot (perhaps some day soon this will possible). Rather, we need to first convert the SpatialPolygonsDataFrame into a specially formatted data frame using the fortify function that is part of the ggplot2 package51. The fortify function will also need a unique identifier for each polygon specified using the region argument, which for mu is the mu_id.

library(ggmap)
mu.f <- fortify(mu, region="mu_id")
head(mu.f)
##     long   lat order  hole piece  id group
## 1 -68.62 44.86     1 FALSE     1 C12 C12.1
## 2 -68.62 44.86     2 FALSE     1 C12 C12.1
## 3 -68.62 44.86     3 FALSE     1 C12 C12.1
## 4 -68.62 44.86     4 FALSE     1 C12 C12.1
## 5 -68.62 44.86     5 FALSE     1 C12 C12.1
## 6 -68.62 44.86     6 FALSE     1 C12 C12.1

Notice the id column in the fortified version of mu holds each polygon’s mu_id value (this will be important later when we link data to the polygons).

Next, we query the satellite imagery used to underlay the management units (we’ll generally refer to this underlying map as the basemap). As of October 2018, Google requires you to set up a Google API account to download imagery. This is free, but it does require a credit card to obtain the API Key that is required to make the ggmap package work. Because of the potential security risk of sharing keys, we do not show the key necessary to run the following code. Instead, you will need to create your own API key to run all the code in this chapter. See the page here for learning about how to use Google maps web services.

register_google(key = "YOUR-KEY-HERE")
mu.bbox <- bbox(mu)
mu.centroid <- apply(mu.bbox, 1, mean)

basemap <- get_map(location=mu.centroid, zoom=14, maptype = "satellite", source = "google")

ggmap(basemap) +
  geom_polygon(data=mu.f, aes(x = long, y = lat, group=group),
                 fill=NA, size=0.2, color="orange")

While the call to get_map can use a bounding box (mu.bbox in code above) to define the basemap extent, I’ve found that it’s easier to pass it a center point location (mu.centroid note the use of the apply function to find the mean in the easting and northing extents) and zoom value to extract the desired extent. The resulting map looks pretty good! Take a look at the get_map function manual page and try different options for maptype (e.g., maptype="terrain").

Next we’ll add the forest inventory plots to the map. Begin by reading in the PEF forest inventory plot data held in “plots.csv”. Recall, foresters have measured forest variables at a set of locations (i.e., inventory plots) within each management unit. The following statement reads these data and displays the resulting data frame structure.

plots <- read.csv("PEF/plots.csv", stringsAsFactors=FALSE)
str(plots)
## 'data.frame':    451 obs. of  8 variables:
##  $ mu_id           : chr  "U10" "U10" "U10" "U10" ...
##  $ plot            : int  11 13 21 22 23 24 31 32 33 34 ...
##  $ easting         : num  529699 529777 529774 529814 529850 ...
##  $ northing        : num  4966333 4966471 4966265 4966336 4966402 ...
##  $ biomass_mg_ha   : num  96.3 115.7 121.6 72 122.3 ...
##  $ stems_ha        : int  5453 2629 3385 7742 7980 10047 5039 5831 2505 7325 ...
##  $ diameter_cm     : num  4.8 6.9 6.1 3.1 4.7 1.6 4.1 5.2 5.7 3.3 ...
##  $ basal_area_m2_ha: num  22 23.2 23 16.1 29.2 19.1 14.1 27.4 21.6 15 ...

In plots each row is a forest inventory plot and columns are:

  • mu_id identifies the management unit within which the plot is located
  • plot unique plot number within the management unit
  • easting longitude coordinate in UTM zone 19
  • northing latitude coordinate in UTM zone 19
  • biomass_mg_ha tree biomass in metric ton (per hectare basis)
  • stocking_stems_ha number of tree (per hectare basis)
  • diameter_cm average tree diameter measured 130 cm up the tree trunk
  • basal_area_m2_ha total cross-sectional area at 130 cm up the tree trunk (per hectare basis)

There is nothing inherently spatial about this data structure—it is simply a data frame. We make plots into a spatial object by identifying which columns hold the coordinates. This is done below using the coordinates function, which promotes the plots data frame to a SpatialPointsDataFrame.

coordinates(plots) <- ~easting+northing
class(plots)
## [1] "SpatialPointsDataFrame"
## attr(,"package")
## [1] "sp"

Although plots is now a SpatialPointsDataFrame, it does not know to which CRS the coordinates belong; hence, the NA when proj4string(plots) is called below. As noted in the plots file description above, easting and northing are in UTM zone 19. This CRS is set using the second call to proj4string below.

proj4string(plots)
## [1] NA
proj4string(plots) <- CRS("+proj=utm +zone=19 +datum=NAD83 +units=m 
                             +no_defs +ellps=GRS80 +towgs84=0,0,0")                   

Now let’s reproject plots to share a common CRS with mu

plots <- spTransform(plots, CRS("+proj=longlat +datum=WGS84"))

Note, because mu is already in the projection we want for plots, we could have replaced the second argument in the spTransform call above with proj4string(mu) and saved some typing.

We’re now ready to add the forest inventory plots to the existing basemap with management units. Specifically, let’s map the biomass_mg_ha variable to show changes in biomass across the forest. No need to fortify plots, ggplot is happy to take geom_point’s data argument as a data frame (although we do need to convert plots from a SpatialPointsDataFrame to a data frame using the as.data.frame function). Check out the scale_color_gradient function in your favorite ggplot2 reference to understand how the color scale is set.

ggmap(basemap) +
    geom_polygon(data=mu.f, aes(x = long, y = lat, group=group), 
                 fill=NA, size=0.2, color="orange") +
    geom_point(data=as.data.frame(plots), 
               aes(x = easting, y = northing, color=biomass_mg_ha)) + 
    scale_color_gradient(low="white", high="darkblue") +
    labs(color = "Biomass (mg/ha)")

There is something subtle and easy to miss in the code above. Notice the aes function arguments in geom_points take geographic longitude and latitude, x and y respectively, from the points data frame (but recall easting and northing were in UTM zone 19). This works because we applied spTransform to reproject the points SpatialPointsDataFrame to geographic coordinates. sp then replaces the values in easting and northing columns with the reprojected coordinate values when converting a SpatialPointsDataFrame to a data frame via as.data.frame().

Foresters use the inventory plot measurements to estimate forest variables within management units, e.g., the average or total management unit biomass. Next we’ll make a plot with management unit polygons colored by average biomass_mg_ha.

mu.bio <- plots@data %>% group_by(mu_id) %>% 
    summarize(biomass_mu = mean(biomass_mg_ha))
print(mu.bio)
## # A tibble: 33 × 2
##    mu_id biomass_mu
##    <chr>      <dbl>
##  1 C12        124. 
##  2 C15         49.9
##  3 C16        128. 
##  4 C17        112. 
##  5 C20        121. 
##  6 C21        134. 
##  7 C22         65.2
##  8 C23A       108. 
##  9 C23B       153. 
## 10 C24        126. 
## # … with 23 more rows

Recall from Section 6.7.5 this one-liner can be read as “get the data frame from plots’s SpatialPointsDataFrame then group by management unit then make a new variable called biomass_mu that is the average of biomass_mg_ha and assign it to the mu.bio tibble.”

The management unit specific biomass_mu can now be joined to the mu polygons using the common mu_id value. Remember when we created the fortified version of mu called mu.f? The fortify function region argument was mu_id which is the id variable in the resulting mu.f. This id variable in mu.f can be linked to the mu_id variable in mu.bio using dplyr’s left_join function as illustrated below.

head(mu.f, n=2)
##     long   lat order  hole piece  id group
## 1 -68.62 44.86     1 FALSE     1 C12 C12.1
## 2 -68.62 44.86     2 FALSE     1 C12 C12.1
mu.f <- left_join(mu.f, mu.bio, by = c('id' = 'mu_id'))

head(mu.f, n=2)
##     long   lat order  hole piece  id group biomass_mu
## 1 -68.62 44.86     1 FALSE     1 C12 C12.1      123.7
## 2 -68.62 44.86     2 FALSE     1 C12 C12.1      123.7

The calls to head() show the first few rows of mu.f pre- and post-join. After the join, mu.f includes biomass_mu, which is used used below for geom_polygon’s fill argument to color the polygons accordingly.

ggmap(basemap) +
    geom_polygon(data=mu.f, aes(x = long, y = lat, 
                                group=group, fill=biomass_mu), 
                 size=0.2, color="orange") +
    scale_fill_gradient(low="white", high="darkblue", 
                        na.value="transparent") +
    labs(fill="Biomass (mg/ha)")

Let’s add the roads and some more descriptive labels as a finishing touch. The roads data include a variable called type that identifies the road type. To color roads by type in the map, we need to join the roads data frame with the fortified roads roads.f using the common variable id as a road segment specific identifier. Then geom_path’s color argument gets this type variable as a factor to create road-specific color. The default coloring of the roads blends in too much with the polygon colors, so we manually set the road colors using the scale_color_brewer function. The palette argument in this function accepts a set of key words, e.g., "Dark2", that specify sets of diverging colors chosen to make map object difference optimally distinct (see, the manual page for scale_color_brewer, http://colorbrewer2.org, and blog here.)52

roads <- readOGR("PEF", "roads")
## OGR data source with driver: ESRI Shapefile 
## Source: "/home/jeffdoser/Dropbox/teaching/for875/textbook/PEF", layer: "roads"
## with 33 features
## It has 2 fields
roads <- spTransform(roads, proj4string(mu))
## Warning in proj4string(mu): CRS object has comment, which is lost in output; in tests, see
## https://cran.r-project.org/web/packages/sp/vignettes/CRS_warnings.html
roads.f <- fortify(roads, region="id")
roads.f <- left_join(roads.f, roads@data, by = c('id' = 'id'))

ggmap(basemap) +
    geom_polygon(data=mu.f, aes(x = long, y = lat, group=group, 
                                fill=biomass_mu), 
                 size=0.2, color="orange") +
    geom_path(data=roads.f, aes(x = long, y = lat, 
                                group=group, color=factor(type))) +
    scale_fill_gradient(low="white", high="darkblue", 
                        na.value="transparent") +
    scale_color_brewer(palette="Dark2") +
    labs(fill="Biomass (mg/ha)", color="Road type", xlab="Longitude", 
         ylab="Latitude", title="PEF forest biomass")
## Warning: Removed 616 row(s) containing missing values
## (geom_path).

The second, and more cryptic, of the two warnings from this code occurs because some of the roads extend beyond the range of the map axes and are removed (nothing to worry about).

8.6 Illustration using leaflet

Leaflet is one of the most popular open-source JavaScript libraries for interactive maps. As noted on the official R leaflet website, it’s used by websites ranging from The New York Times and The Washington Post to GitHub and Flickr, as well as by GIS specialists like OpenStreetMap, Mapbox, and CartoDB.

The R leaflet website is an excellent resource to learn leaflet basics, and should serve as a reference to gain a better understanding of the topics we briefly explore below.

You create a leaflet map using these basic steps:

  1. Create a map by calling leaflet()
  2. Add data layers to the map using layer functions such as, addTiles(), addMarkers(), addPolygons(), addCircleMarkers(), addPolylines(), addRasterImage() and other add... functions
  3. Repeat step 2 to add more layers to the map
  4. Print the map to display it

Here’s a brief example.

library(leaflet)

m <- leaflet() %>%
      addTiles() %>%  # Add default OpenStreetMap map tiles
      addMarkers(lng=-84.482004, lat=42.727516, 
                 popup="<b>Here I am!</b>") # Add a clickable marker
m      # Print the map

There are a couple things to note in the code. First, we use the pipe operator %>% just like in dplyr functions. Second, the popup argument in addMarkers() takes standard HTML and clicking on the marker makes the text popup. Third, the html version of this text provides the full interactive, dynamic map, so we encourage you to read and interact with the html version of this textbook for this section. The PDF document will simply display a static version of this map and will not do justice to how awesome leaflet truly is!

As seen in the leaflet() call above, the various add... functions can take longitude (i.e., lng) and latitude (i.e., lat). Alternatively, these functions can extract the necessary spatial information from sp objects, e.g., Table 8.1, when passed to the data argument (which greatly simplifies life compared with map making using ggmap).

8.7 Subsetting Spatial Data

You can imagine that we might want to subset spatial objects to map specific points, lines, or polygons that meet some criteria, or perhaps extract values from polygons or raster surfaces at a set of points or geographic extent. These, and similar types, of operations are easy in R (as long as all spatial objects are in a common CRS). Recall from Chapter 4 how handy it is to subset data structures, e.g., vectors and data frames, using the [] operator and logical vectors? Well it’s just as easy to subset spatial objects, thanks to the authors of sp, raster, and other spatial data packages.

8.7.1 Fetching and Cropping Data using raster

In order to motivate our exploration of spatial data subsetting and to illustrate some useful functionality of the raster package, let’s download some elevation data for the PEF. The raster package has a rich set of functions for manipulating raster data as well as functions for downloading data from open source repositories. We’ll focus on the package’s getData function, which, given a location in geographic longitude and latitude or location name, will download data from GADM, Shuttle Radar Topography Mission, Global Climate Data, and other sources commonly used in spatial data applications.

Let’s download SRTM surface elevation data for the PEF, check the resulting object’s class and CRS, and display it using the raster package’s image function along with the PEF forest inventory plots.

library(raster)

pef.centroid <- as.data.frame(plots) %>% 
    summarize(mu.x = mean(easting), mu.y = mean(northing))

srtm <- getData("SRTM", lon = pef.centroid[, 1], lat = pef.centroid[, 2])
class(srtm)
## [1] "RasterLayer"
## attr(,"package")
## [1] "raster"
proj4string(srtm)
## [1] "+proj=longlat +datum=WGS84 +no_defs"
image(srtm)
plot(plots, add = TRUE)

A few things to notice in the code above. First the getData function needs the longitude lon and latitude lat to identify which SRTM raster tile to return (SRTM data come in large raster tiles that cover the globe). As usual, look at the getData function documentation for a description of the arguments. To estimate the PEF’s centroid coordinate, we averaged the forest inventory plots’ latitude and longitude then assigned the result to pef.centroid (for fun, I used the apply function to do the same task when creating mu.centroid earlier in the chapter). Second, srtm is in a longitude latitude geographic CRS (same as our other PEF data). Finally, the image shows SRTM elevation along the coast of Maine, the PEF plots are those tiny specks of black in the northwest quadrant, and the white region of the image is the Atlantic Ocean.

Okay, this is a start, but it would be good to crop the SRTM image to the PEF’s extent. This is done using raster’s crop function. This function can use many different kinds of spatial objects in the second argument to calculate the extent at which to crop the object in the first argument. Here, I set mu as the second argument and save the resulting SRTM subset over the larger tile (the srtm object).

srtm <- crop(srtm, mu)

image(srtm)
plot(mu, add = TRUE)

The crop is in effect doing a spatial subsetting of the raster data. We’ll return to the srtm data and explore different kinds of subsetting in the subsequent sections.

8.7.2 Logical, Index, and Name Subsetting

As promised, we can subset spatial objects using the [] operator and a logical, index, or name vector. The key is that sp objects behave like data frames, see Section 4.4. A logical or index vector to the left of the comma in [,] accesses points, lines, polygons, or pixels. Similarly, a logical, index, or name vector to the right of the comma accesses variables.

For example, say we want to map forest inventory plots with more than 10,000 stems per hectare, stems_ha (the min() was added below to double check that the subset worked correctly.

min(plots$stems_ha)
## [1] 119
plots.10k <- plots[plots$stems_ha > 10000,]

min(plots.10k$stems_ha)     
## [1] 10008

You can also add new variables to the spatial objects.

plots$diameter_in <- plots$diameter_cm/2.54

head(plots)
##   mu_id plot biomass_mg_ha stems_ha diameter_cm
## 1   U10   11         96.35     5453         4.8
## 2   U10   13        115.70     2629         6.9
## 3   U10   21        121.58     3385         6.1
## 4   U10   22         71.97     7742         3.1
## 5   U10   23        122.26     7980         4.7
## 6   U10   24         85.85    10047         1.6
##   basal_area_m2_ha diameter_in
## 1             22.0      1.8898
## 2             23.2      2.7165
## 3             23.0      2.4016
## 4             16.1      1.2205
## 5             29.2      1.8504
## 6             19.1      0.6299

8.7.3 Spatial Subsetting and Overlay

A spatial overlay retrieves the indexes or variables from object \(A\) using the location of object \(B\). With some spatial objects this operation can be done using the [] operator. For example, say we want to select and map all management units in mu, i.e., \(A\), that contain plots with more than 10,000 stems per ha, i.e., \(B\).

mu.10k <- mu[plots.10k,]## A[B,]

mu.10k.f <- fortify(mu.10k, region="mu_id")

ggmap(basemap) +
    geom_polygon(data=mu.10k.f, aes(x = long, y = lat, group=group), 
         fill="transparent", size=0.2, color="orange") +
    geom_point(data=as.data.frame(plots.10k), 
           aes(x = easting, y = northing), color="white")

More generally, however, the over function offers consistent overlay operations for sp objects and can return either indexes or variables from object \(A\) given locations from object \(B\), i.e., over(B, A) or, equivalently, B%over%A. The code below duplicates the result from the preceding example using over.

mu.10k <- mu[mu$mu_id %in% unique(over(plots.10k, mu)$mu_id),]

Yes, this requires more code but over provides a more flexible and general purpose function for overlays on the variety of sp objects. Let’s unpack this one-liner into its five steps.

i <- over(plots.10k, mu)
ii <- i$mu_id
iii <- unique(ii)
iv <- mu$mu_id %in% iii
v <- mu[iv,]
  1. The over function returns variables for mu’s polygons that coincide with the nrow(plots.10k@data) points in plots.10k. No points fall outside the polygons and the polygons do not overlap, so i should be a data frame with nrow(plots.10k@data) rows. If polygons did overlap and a point fell within the overlap region, then variables for the coinciding polygons are returned.
  2. Select the unique mu identifier mu_id (this step is actually not necessary here because mu only has one variable).
  3. Because some management units contain multiple plots there will be repeat values of mu_id in ii, so apply the unique function to get rid of duplicates.
  4. Use the %in% operator to create a logical vector that identifies which polygons should be in the final map.
  5. Subset mu using the logical vector created in iv.

Now let’s do something similar using the srtm elevation raster. Say we want to map elevation along trails, winter roads, and gravel roads across the PEF. We could subset srtm using the roads SpatialLinesDataFrame; however, mapping the resulting pixel values along the road segments using ggmap requires a bit more massaging. So, to simplify things for this example, roads is first coerced into a SpatialPointsDataFrame called roads.pts that is used to extract spatially coinciding srtm pixel values which themselves are coerced from raster’s RasterLayer to sp’s SpatialPixelsDataFrame called srtm.sp so that we can use the over function. We also choose a different basemap just for fun.

hikes <- roads[roads$type %in% c("Trail", "Winter", "Gravel"),]
 
hikes.pts <- as(hikes, "SpatialPointsDataFrame")
srtm.sp <- as(srtm, "SpatialPixelsDataFrame")

hikes.pts$srtm <- over(hikes.pts, srtm.sp)

basemap <- get_map(location=mu.centroid, zoom=14, 
           maptype = "terrain", source = "google")

color.vals <- srtm@data@values[1:length(hikes.pts)]

ggmap(basemap) +
    geom_point(data=as.data.frame(hikes.pts),
               aes(x = coords.x1, y = coords.x2, color = color.vals)) +
    scale_color_gradient(low="green", high="red") +
    labs(color = "Hiking trail elevation\n(m above sea level)",
         xlab="Longitude", ylab="Latitude")
## Warning: Removed 483 rows containing missing values
## (geom_point).

In the call to geom_point above, coords.x1 coords.x2 are the default names given to longitude and latitude, respectively, when sp coerces hikes to hikes.pts. These points represent the vertices along line segments. I create the vector color.vals that contains the values from srtm that I use in the map argument color.

Overlay operations involving lines and polygons over polygons require the rgeos package which provides an interface to the Geometry Engine - Open Source (GEOS) C++ library for topology operations on geometries. We’ll leave it to you to explore these capabilities.

8.7.4 Spatial Aggregation

We have seen aggregation operations before when using dplyr’s summarize function. The summarize function is particularly powerful when combined with group_by(), which can apply a function specified in summarize() to a variable partitioned using a grouping variable. The aggregate function in sp works in a similar manner, except groups are delineated by the spatial extent of a thematic variable. In fact, the work we did to create mu.bio using dplyr functions can be accomplished with aggregate(). Using aggregate() will, however, require a slightly different approach for joining the derived average biomass_mg_ha to the fortified mu. This is because the aggregate function will apply the user specified function to all variables in the input object, which, in our case, results in an NA for the linking variable mu_id as demonstrated below.

mu.ag <- aggregate(plots[,c("mu_id","biomass_mg_ha")], by=mu, FUN=mean)

head(mu.ag@data, n=2)
##   mu_id biomass_mg_ha
## 0  <NA>         49.86
## 1  <NA>        112.17

With mu_id rendered useless, we do not have a variable that uniquely identifies each polygon for use in fortify’s region argument; hence no way to subsequently join the unfortified and fortified versions of mu.bio.ag. Here’s the work around. If the region is not specified, fortify() uses an internal unique polygon ID that is part of the sp data object and accessed via row.names()53 So, the trick is to add this unique polygon ID to the aggregate() output prior to calling fortify() as demonstrated below.

mu.ag$id <- row.names(mu.ag)

mu.ag.f <- fortify(mu.ag)
## Regions defined for each Polygons
mu.ag.f <- left_join(mu.ag.f, mu.ag@data)
## Joining, by = "id"
ggmap(basemap) +
    geom_polygon(data=mu.ag.f, aes(x = long, y = lat, 
                                group=group, fill=biomass_mg_ha), 
                 size=0.2, color="orange") +
    scale_fill_gradient(low="white", high="darkblue", 
                        na.value="transparent") +
    labs(fill="Biomass (mg/ha)")

The aggregate() function will work with all sp objects. For example let’s map the variance of pixel values in srtm.sp by management unit. Notice that aggregate() is happy to take a user-specified function for FUN.

mu.srtm <- aggregate(srtm.sp, by=mu,
                     FUN=function(x){sqrt(var(x))})

mu.srtm$id <- row.names(mu.srtm)

mu.srtm.f <- fortify(mu.srtm)
## Regions defined for each Polygons
mu.srtm.f <- left_join(mu.srtm.f, mu.srtm@data)
## Joining, by = "id"
ggmap(basemap) +
    geom_polygon(data=mu.srtm.f, aes(x = long, y = lat, group=group, 
                                     fill=srtm_23_04), 
                 size=0.2, color="orange") +
    scale_fill_gradient(low="green", high="red") +
        labs(fill = "Elevation standard deviation\n(m above sea level)", 
         xlab="Longitude", ylab="Latitude")

8.8 Where to go from here

This chapter just scratches the surface of R’s spatial data manipulation and visualization capabilities. The basic ideas we presented here should allow you to take a deeper look into sp, rgdal, rgeos, ggmap, leaflet, and a myriad of other excellent user-contributed R spatial data packages. A good place to start is with Edzer Pebesma’s excellent vignette on the use of the map overlay and spatial aggregation, available here, as well as Applied Spatial Data Analysis with R by .

8.9 Exercises

Exercise Spatial Data Learning objectives: practice loading and reprojecting spatial data; analyze spatial data; create leaflet maps to convey analysis results; interpret analysis results.

References

Bivand, Roger S., Edzer Pebesma, and Virgilio Gomez-Rubio. 2013. Applied Spatial Data Analysis with R, Second Edition. Springer, NY.

Wegmann. 2010. “Raster Vector Tikz.png.” 2010. https://commons.wikimedia.org/wiki/File:Raster_vector_tikz.png.


  1. A longer list of spatial data file formats is available at https://en.wikipedia.org/wiki/GIS_file_formats.↩︎

  2. A more complete list of the sp package’s spatial data classes and methods is detailed in the package’s vignette https://cran.r-project.org/web/packages/sp/vignettes/intro_sp.pdf.↩︎

  3. This cheatsheet http://www.maths.lancs.ac.uk/~rowlings/Teaching/UseR2012/cheatsheet.html written by Barry Rowlingson is a useful reference http://www.maths.lancs.ac.uk/~rowlings/Teaching/UseR2012/cheatsheet.html↩︎

  4. The list.files function does not read data into R; it simply prints the contents of a directory.↩︎

  5. The authors of the rgdal library decided to have some information about the version of GDAL and other software specifics be printed when the library is loaded. Don’t let it distract you.↩︎

  6. Much of the actual spatial information is hidden from you in other parts of the data structure, but is available if you ask nicely for it (see subsequent sections).↩︎

  7. See partial list of map projections at https://en.wikipedia.org/wiki/List\_of\_map\_projections. See a humorous discussion of map projections at http://brilliantmaps.com/xkcd/.↩︎

  8. If you start dealing with a lot of spatial data and reprojecting, http://spatialreference.org is an excellent resources for finding and specifying coordinate reference systems.↩︎

  9. ggmap depends on ggplot2 so ggplot2 will be automatically loaded when you call library(ggmap).↩︎

  10. Install the RColorBrewer package and run library(RColorBrewer); display.brewer.all() to get a graphical list of available palettes.↩︎

  11. With other data, there is a chance the row names differ from the unique polygon IDs. Therefore a more reliable approach to getting a unique ID is to use sapply(slot(mu.ag, 'polygons'), function(x) slot(x, 'ID')), but replace mu.ag with your SpatialPolygonsDataFrame. Also, this approach will work with other sp objects in the right column of Table 8.1.↩︎