Chapter 6 Working with Data

Importing/exporting data, cleaning data, reshaping data, wrangling data, and other data manipulation tasks are an important and often overlooked component of data science. The book Spector (2008), while a few years old, is still an excellent reference for data-related issues, and is available for free at http://catalog.lib.msu.edu/record=b7254984~S39a. In this chapter we will present a variety of tools for common data analysis tasks you will encounter in R.

6.1 Data import

Data come in a dizzying variety of formats. It might be a proprietary format used by commercial software such as Excel, SPSS, or SAS. It might be structured using a relational model, for example, the USDA Forest Service Forest Inventory and Analysis database (Burrill et al. 2018). It might be a data-interchange format such as JSON (JavaScript Object Notation; Pezoa et al. (2016)), or a markup language format such as XML (Extensible Markup Language; Bray et al. (2008)) perhaps with specialized standards for describing ecological information such as EML (Ecological Metadata Language; Jones et al. (2019)). The good news is we’ve yet to encounter a data format unreadable by R. This is thanks to the large community of R package developers and data formats specific to their respective disciplines. For example, the foreign and haven packages provide functions for reading and writing data in common proprietary statistical software formats. Similarly, as we’ll see in Chapter 8, the rgdal and raster packages read and write pretty much any spatial data formats used in popular GIS software.

Fortunately many datasets are (or can be) saved as plain text files. Most software can read and write such files, so our initial focus is on reading and writing plain text files.

The function read.table and its offshoots, such as read.csv, are used to read 2-dimensional data, i.e., rows and columns, from a plain text file and are the core functions for reading external datasets into R. For example, the file BrainAndBody.csv contains data35 on the brain weight, body weight, and name of some terrestrial animals. Here are the first few lines of that file:

body,brain,name
1.35,8.1,Mountain beaver
465,423,Cow
36.33,119.5,Grey wolf
27.66,115,Goat
1.04,5.5,Guinea pig

As is evident, the first line of the file contains the names of the three variables, separated (delimited) by commas. Each subsequent line (row) contains the body weight, brain weight, and name of a specific terrestrial animal.

This file is accessible at the url https://www.finley-lab.com/files/data/BrainAndBody.csv. The read.table function is used to read these data into an R data frame. A data frame is a common rectangular (i.e., two-dimensional) data structure in R that consists of rows and columns. We will cover data frames extensively in Chapter 4.

link.bb <- "https://www.finley-lab.com/files/data/BrainAndBody.csv"
BrainBody <- read.table(file = link.bb, header = TRUE, sep = ",", 
                        stringsAsFactors = FALSE)
str(BrainBody)
## 'data.frame':    28 obs. of  3 variables:
##  $ body : num  1.35 465 36.33 27.66 1.04 ...
##  $ brain: num  8.1 423 119.5 115 5.5 ...
##  $ name : chr  "Mountain beaver" "Cow" "Grey wolf" "Goat" ...
head(BrainBody)
##       body brain            name
## 1     1.35   8.1 Mountain beaver
## 2   465.00 423.0             Cow
## 3    36.33 119.5       Grey wolf
## 4    27.66 115.0            Goat
## 5     1.04   5.5      Guinea pig
## 6 11700.00  50.0     Dipliodocus

The read.table arguments and associated values used above include:

  1. file = link.bb identifies the file location. In this case the string https://www.finley-lab.com/files/data/BrainAndBody.csv giving the location is rather long, so it was first assigned to the object link.bb.
  2. header = TRUE specifies the column names are held in the file’s first row, known as the “header”.
  3. sep = "," defines the comma as the character delimiter between the file’s columns.
  4. stringsAsFactors = FALSE tells R not to convert character vectors to factors. While we have yet to discuss the different types of vectors in R, prior to R 4.0.0 stringsAsFactors=TRUE was the default behavior, and so we will often explicitly include this argument in our call to read.table to avoid any differences across R versions. The new default value is FALSE so we could have chosen not to include this in the code).

When we call read.table or read.csv, the resulting object is read into R and represented as a data frame. The str function and head are two useful functions for looking at the structure of a data frame (or any R object). The str function displays the structure of the object BrainBody that contains the data from the BrainAndBody.csv file. Notice the object is a data frame with 28 observations (i.e., rows) and 3 variables (i.e., columns). The column names are displayed following the $36. The head function displays the the first six rows and all columns of the data frame. An alternative way to view a data frame is to use the View function, which opens up a spreadsheet interface in R. Try this out by running View(BrainBody) in your console.

The function read.csv is the same as read.table except the default separator is a comma, whereas the default separator for read.table is whitespace.

The file BrainAndBody.tsv contains the same data, except a tab is used in place of a comma to separate fields. The only change needed to read in the data in this file is in the sep argument (and of course the file argument, since the data are stored in a different file):

file.bb <- "https://www.finley-lab.com/files/data/BrainAndBody.tsv"
BrainBody2 <- read.table(file = file.bb, header = TRUE, sep = "\t", 
                         stringsAsFactors = FALSE)
head(BrainBody2)
##       body brain            name
## 1     1.35   8.1 Mountain beaver
## 2   465.00 423.0             Cow
## 3    36.33 119.5       Grey wolf
## 4    27.66 115.0            Goat
## 5     1.04   5.5      Guinea pig
## 6 11700.00  50.0     Dipliodocus

File extensions, e.g., .csv or .tsv, are naming conventions only and are there to remind us how the columns are separated. In other words, they have no influence on R’s file read functions.

A third file, BrainAndBody.txt, contains the same data, but also contains a few lines of explanatory text above the names of the variables. It also uses whitespace rather than a comma or a tab as a separator. Here are the first several lines of the file.

This file contains data
on brain and body
weights of several terrestrial animals

"body" "brain" "name"
1.35 8.1 "Mountain beaver"
465 423 "Cow"
36.33 119.5 "Grey wolf"
27.66 115 "Goat"
1.04 5.5 "Guinea pig"
11700 50 "Dipliodocus"
2547 4603 "Asian elephant"

Notice that in this file the values of name are put inside of quotation marks. This is necessary since instead R would (reasonably) assume the first line contained the values of four variables, the values being 1.35, 8.1, Mountain, and beaver while in reality there are only three values desired, with Mountain beaver being the third.

To read in this file we need to tell R to skip the first five lines and to use whitespace as the separator. The skip argument handles the first, and the sep argument the second. First let’s see what happens if we don’t use the skip argument.

file.bb <- "https://www.finley-lab.com/files/data/BrainAndBody.txt"
BrainBody3 <- read.table(file.bb, header = TRUE, sep = " ",
                         stringsAsFactors = FALSE)
## Error in scan(file = file, what = what, sep = sep, quote = quote, dec = dec, : line 1 did not have 5 elements

R assumed the first line of the file contained the variable names, since header = TRUE was specified, and counted four including This, file, contains, and data. So in the first line of actual data, R expected four columns containing data plus possibly a fifth column containing row names for the data set, and complained that “line 1 did not have 5 elements.” The error message is somewhat mysterious, since it starts with “Error in scan.” This happens because read.table actually uses a more basic R function called scan to do the work.

Here’s how to read in the file correctly.

file.bb <- "https://www.finley-lab.com/files/data/BrainAndBody.txt"
BrainBody3 <- read.table(file.bb, header = TRUE, sep = " ",
                         stringsAsFactors = FALSE, skip = 4)
head(BrainBody3)
##       body brain            name
## 1     1.35   8.1 Mountain beaver
## 2   465.00 423.0             Cow
## 3    36.33 119.5       Grey wolf
## 4    27.66 115.0            Goat
## 5     1.04   5.5      Guinea pig
## 6 11700.00  50.0     Dipliodocus

6.2 Importing data with missing observations

Missing data are represented in many ways. Sometimes a missing data point is just that, i.e., the place where it should be in the file is blank. Other times specific numbers such as \(-9999\) or specific symbols are used. The read.table() function has an argument na.string that allows the user to specify how missing data is indicated in the source file.

The site http://www.wunderground.com/history/ makes weather data available for locations around the world from dates going back to 1945. The file WeatherKLAN2014.csv contains weather data for Lansing, Michigan for the year 2014. Here are the first few lines of that file:

EST,Max TemperatureF,Min TemperatureF, Events
1/1/14,14,9,Snow
1/2/14,13,-3,Snow
1/3/14,13,-11,Snow
1/4/14,31,13,Snow
1/5/14,29,16,Fog-Snow
1/6/14,16,-12,Fog-Snow
1/7/14,2,-13,Snow
1/8/14,17,-1,Snow
1/9/14,21,2,Snow
1/10/14,39,21,Fog-Rain-Snow
1/11/14,41,32,Fog-Rain
1/12/14,39,31,

Look at the last line, and notice that instead of an Event such as Snow or Fog-Snow there is nothing after the comma. This observation is missing, but rather than using an explicit code such as NA37, the site just leaves that entry blank. To read these data into R we will supply the argument na.string = "" which tells R the file indicates missing data by leaving the appropriate entry blank. We then print a subset of the resulting data frame to make sure the offending missing values were replaced with NA.

u.weather <- "https://www.finley-lab.com/files/data/WeatherKLAN2014.csv"
WeatherKLAN2014 <- read.csv(u.weather, header=TRUE,
                            stringsAsFactors = FALSE, na.string = "")
head(WeatherKLAN2014, n = 15)
##        EST Max.TemperatureF Min.TemperatureF
## 1   1/1/14               14                9
## 2   1/2/14               13               -3
## 3   1/3/14               13              -11
## 4   1/4/14               31               13
## 5   1/5/14               29               16
## 6   1/6/14               16              -12
## 7   1/7/14                2              -13
## 8   1/8/14               17               -1
## 9   1/9/14               21                2
## 10 1/10/14               39               21
## 11 1/11/14               41               32
## 12 1/12/14               39               31
## 13 1/13/14               44               34
## 14 1/14/14               37               26
## 15 1/15/14               27               18
##           Events
## 1           Snow
## 2           Snow
## 3           Snow
## 4           Snow
## 5       Fog-Snow
## 6       Fog-Snow
## 7           Snow
## 8           Snow
## 9           Snow
## 10 Fog-Rain-Snow
## 11      Fog-Rain
## 12          <NA>
## 13          Rain
## 14     Rain-Snow
## 15          Snow

Notice the use of the argument n = 15 in the head function, which displays the first 15 rows of the WeatherKLAN2014 data frame38.

6.3 Data export

Data stored in external text files are rarely in the format required for data analysis and summarization. We may need to create new variables in the data set, transform variables to different scales, rearrange the data, or manipulate the data to fix incorrect observations. The process of identifying and ultimating fixing incorrect observations from a data set is called data cleaning, while data wrangling is the process of obtaining and transforming data to answer a specific data analysis question. R has a vast suite of tools for data cleaning and data wrangling that we will cover extensively in Section 6.7. After cleaning and wrangling the data in R, we may want to create a new external text file that contains the data in the desired data analysis format for use in a separate project, submission to an agency, or archiving on a data repository. The R function write.table allows us to take a data set in R and save it as an external file.

To illustrate write.table, let’s try to create a new csv file in our current working directory named WeatherKLAN2014New.csv from the WeatherKLAN2014 object we previously created. We do this with the following code.

write.table(x = WeatherKLAN2014, file = 'WeatherKLAN2014New.csv', sep = ',',
        row.names = FALSE, na = 'NA')

Let’s walk through each of the arguments and associated values used in the previous code chunk:

  • x = WeatherKLAN2014 is the R object we want to write to an external file, in this case the WeatherKLAN2014 data set.
  • file = 'WeatherKLAN2014New.csv' is the name of the file in which we want to store the data specified in argument x. The file will be created relative in our current working directory as this is the directory R uses for relative path.
  • sep = ',' defines the comma as the character delimiter between the file’s columns, just as we saw with read.table.
  • row.names = FALSE tells R not to create a column in the external data set that contains the row names of the associated R object x. This argument is by default row.names = TRUE. We do not recommend storing data in the row names of R objects, and so we usually set row.names = FALSE.
  • na = NA tells R to use the value NA to represent missing data in the resulting text file.

Similar changes can be made to write.table to create tab-separated files, plain text files, or other simple text files by changing the sep argument and the file extension in the file argument.

6.4 Summarizing Data Frames

Some common data tasks include variable summaries such as means or standard deviations, transforming an existing variable, and creating new variables. As with many tasks, there are several ways to accomplish each of these.

6.4.1 Column (and Row) Summaries

The file WeatherKLAN2014Full.csv contains a more complete set of weather data variables than WeatherKLAN2014.csv, from the same source, http://www.wunderground.com/history.

u.weather <- "https://www.finley-lab.com/files/data/WeatherKLAN2014Full.csv"
WeatherKLAN2014Full <- read.csv(u.weather, header=TRUE, 
                               stringsAsFactors = FALSE, 
                               na.string = "")
names(WeatherKLAN2014Full)
##  [1] "EST"                      
##  [2] "Max.TemperatureF"         
##  [3] "Mean.TemperatureF"        
##  [4] "Min.TemperatureF"         
##  [5] "Max.Dew.PointF"           
##  [6] "MeanDew.PointF"           
##  [7] "Min.DewpointF"            
##  [8] "Max.Humidity"             
##  [9] "Mean.Humidity"            
## [10] "Min.Humidity"             
## [11] "Max.Sea.Level.PressureIn" 
## [12] "Mean.Sea.Level.PressureIn"
## [13] "Min.Sea.Level.PressureIn" 
## [14] "Max.VisibilityMiles"      
## [15] "Mean.VisibilityMiles"     
## [16] "Min.VisibilityMiles"      
## [17] "Max.Wind.SpeedMPH"        
## [18] "Mean.Wind.SpeedMPH"       
## [19] "Max.Gust.SpeedMPH"        
## [20] "PrecipitationIn"          
## [21] "CloudCover"               
## [22] "Events"                   
## [23] "WindDirDegrees"

How can we compute the mean for each variable? One possibility is to do this a variable at a time:

mean(WeatherKLAN2014Full$Mean.TemperatureF)
## [1] 45.78
mean(WeatherKLAN2014Full$Min.TemperatureF)
## [1] 36.25
mean(WeatherKLAN2014Full$Max.TemperatureF)
## [1] 54.84
##Et Cetera

This is pretty inefficient. Fortunately there is a colMeans() function which computes the mean of each column (or a specified number of columns) in a data frame. Some columns in the current data frame are not numeric, and obviously we don’t want to ask R to compute means for these columns. We use str() to investigate.

str(WeatherKLAN2014Full)
## 'data.frame':    365 obs. of  23 variables:
##  $ EST                      : chr  "2014-1-1" "2014-1-2" "2014-1-3" "2014-1-4" ...
##  $ Max.TemperatureF         : int  14 13 13 31 29 16 2 17 21 39 ...
##  $ Mean.TemperatureF        : int  12 5 1 22 23 2 -5 8 12 30 ...
##  $ Min.TemperatureF         : int  9 -3 -11 13 16 -12 -13 -1 2 21 ...
##  $ Max.Dew.PointF           : int  9 7 2 27 27 11 -6 7 18 37 ...
##  $ MeanDew.PointF           : int  4 4 -5 18 21 -4 -13 1 8 28 ...
##  $ Min.DewpointF            : int  0 -8 -14 3 11 -18 -18 -6 0 19 ...
##  $ Max.Humidity             : int  88 76 83 92 92 80 78 88 88 100 ...
##  $ Mean.Humidity            : int  76 70 68 73 86 73 72 78 75 92 ...
##  $ Min.Humidity             : int  63 63 53 53 80 65 65 67 62 84 ...
##  $ Max.Sea.Level.PressureIn : num  30.4 30.4 30.5 30.1 30 ...
##  $ Mean.Sea.Level.PressureIn: num  30.3 30.2 30.4 30 29.9 ...
##  $ Min.Sea.Level.PressureIn : num  30.2 30.1 30.1 29.9 29.7 ...
##  $ Max.VisibilityMiles      : int  10 9 10 10 4 10 10 10 10 9 ...
##  $ Mean.VisibilityMiles     : int  4 4 10 6 1 2 6 10 7 3 ...
##  $ Min.VisibilityMiles      : int  1 0 5 1 0 0 1 8 2 0 ...
##  $ Max.Wind.SpeedMPH        : int  17 22 23 28 22 31 25 15 14 17 ...
##  $ Mean.Wind.SpeedMPH       : int  9 13 10 15 11 18 15 7 6 10 ...
##  $ Max.Gust.SpeedMPH        : int  22 30 32 36 30 40 31 18 17 22 ...
##  $ PrecipitationIn          : chr  "0.08" "0.01" "0.00" "0.12" ...
##  $ CloudCover               : int  8 7 1 5 8 8 6 5 7 8 ...
##  $ Events                   : chr  "Snow" "Snow" "Snow" "Snow" ...
##  $ WindDirDegrees           : int  43 24 205 203 9 262 220 236 147 160 ...

It isn’t surprising that EST and Events are not numeric, but is surprising that PrecipitationIn, which measures precipitation in inches, also is not numeric, but is character. Let’s investigate further.

WeatherKLAN2014Full$PrecipitationIn[1:50]
##  [1] "0.08" "0.01" "0.00" "0.12" "0.78" "0.07" "T"   
##  [8] "T"    "0.01" "0.39" "0.16" "0.00" "0.00" "0.01"
## [15] "T"    "0.08" "T"    "T"    "T"    "0.01" "0.00"
## [22] "0.05" "T"    "T"    "0.07" "0.23" "0.04" "T"   
## [29] "T"    "0.03" "T"    "0.37" "T"    "0.00" "T"   
## [36] "0.27" "0.01" "T"    "0.04" "0.03" "T"    "0.00"
## [43] "0.00" "T"    "T"    "0.00" "0.02" "0.15" "0.08"
## [50] "0.01"

Now it’s more clear. The original data file included T in the precipitation column to represent a “trace” of precipitation, which is precipitation greater than \(0\) but less than \(0.01\) inches. One possibility would be to set all these values to "0", and then to convert the column to numeric. For now we will just leave the PrecipitationIn column out of the columns for which we request the mean.

colMeans(WeatherKLAN2014Full[,c(2:19, 21, 23)])
##          Max.TemperatureF         Mean.TemperatureF 
##                    54.838                    45.781 
##          Min.TemperatureF            Max.Dew.PointF 
##                    36.255                    41.800 
##            MeanDew.PointF             Min.DewpointF 
##                    36.395                    30.156 
##              Max.Humidity             Mean.Humidity 
##                    88.082                    70.392 
##              Min.Humidity  Max.Sea.Level.PressureIn 
##                    52.200                    30.130 
## Mean.Sea.Level.PressureIn  Min.Sea.Level.PressureIn 
##                    30.015                    29.904 
##       Max.VisibilityMiles      Mean.VisibilityMiles 
##                     9.896                     8.249 
##       Min.VisibilityMiles         Max.Wind.SpeedMPH 
##                     4.825                    19.101 
##        Mean.Wind.SpeedMPH         Max.Gust.SpeedMPH 
##                     8.679                        NA 
##                CloudCover            WindDirDegrees 
##                     4.367                   205.000

6.4.2 The apply() Function

R also has functions rowMeans(), colSums(), and rowSums(). But what if we want to compute the median or standard deviation of columns of data, or some other summary statistic? For this the apply() function can be used. This function applies a user-chosen function to either the rows or columns (or both) of a data frame. The arguments are:

  1. X: the data frame of interest
  2. MARGIN: specifying either rows (MARGIN = 1) or columns (MARGIN = 2)
  3. FUN: the function to be applied.
apply(X = WeatherKLAN2014Full[,c(2:19, 21, 23)], MARGIN = 2, FUN = sd)
##          Max.TemperatureF         Mean.TemperatureF 
##                   22.2130                   20.9729 
##          Min.TemperatureF            Max.Dew.PointF 
##                   20.2597                   19.5167 
##            MeanDew.PointF             Min.DewpointF 
##                   20.0311                   20.8511 
##              Max.Humidity             Mean.Humidity 
##                    8.1910                    9.3660 
##              Min.Humidity  Max.Sea.Level.PressureIn 
##                   13.9462                    0.2032 
## Mean.Sea.Level.PressureIn  Min.Sea.Level.PressureIn 
##                    0.2159                    0.2360 
##       Max.VisibilityMiles      Mean.VisibilityMiles 
##                    0.5790                    2.1059 
##       Min.VisibilityMiles         Max.Wind.SpeedMPH 
##                    3.8168                    6.4831 
##        Mean.Wind.SpeedMPH         Max.Gust.SpeedMPH 
##                    3.8863                        NA 
##                CloudCover            WindDirDegrees 
##                    2.7798                   90.0673

As with any R function the arguments don’t need to be named as long as they are specified in the correct order, so

apply(WeatherKLAN2014Full[,c(2:19, 21, 23)], 2, sd)

has the same result.

6.4.3 Practice Problems

  1. Notice the output value of NA for the column Max.Gust.SpeedMPH. Why does this happen? Figure out a way to make the apply() function return a numeric value for this column.
  2. The apply() family of functions is extremely important in R, so it gets two Practice Problems :) Use the apply() function to compute the median values for all numeric columns in the iris data set.

6.4.4 Saving Typing Using with()

Consider calculating the mean of the maximum temperature values for those days where the cloud cover is less than 4 and when the maximum humidity is over 85. We can do this using subsetting.

mean(WeatherKLAN2014Full$Max.TemperatureF[
                             WeatherKLAN2014Full$CloudCover < 4 & 
                             WeatherKLAN2014Full$Max.Humidity > 85])
## [1] 69.39

While this works, it requires a lot of typing, since each time we refer to a variable in the data set we need to preface its name by WeatherKLAN2014Full$. The with() function tells R that we are working with a particular data frame, and we don’t need to keep typing the name of the data frame.

with(WeatherKLAN2014Full, 
     mean(Max.TemperatureF[CloudCover < 4 & Max.Humidity > 85]))
## [1] 69.39

6.5 Transforming a Data Frame

Variables are often added to, removed from, changed in, or rearranged in a data frame. The subsetting features of R make this reasonably easy. We will investigate this in the context of the gapminder data frame. If the gapminder library is not yet installed, use install.packages("gapminder") to install it locally.

library(gapminder)
str(gapminder)
## tibble [1,704 × 6] (S3: tbl_df/tbl/data.frame)
##  $ country  : Factor w/ 142 levels "Afghanistan",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ continent: Factor w/ 5 levels "Africa","Americas",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ year     : int [1:1704] 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
##  $ lifeExp  : num [1:1704] 28.8 30.3 32 34 36.1 ...
##  $ pop      : int [1:1704] 8425333 9240934 10267083 11537966 13079460 14880372 12881816 13867957 16317921 22227415 ...
##  $ gdpPercap: num [1:1704] 779 821 853 836 740 ...

6.5.1 Adding Variables

The data frame contains per capita GDP and population, and it might be interesting to create a variable that gives the total GDP by multiplying these two variables. (If we were interested in an accurate value for the total GDP we would probably be better off getting this information directly, since it is likely that the per capita GDP values in the data frame are rounded substantially.)

gapminder$TotalGDP <- gapminder$gdpPercap * gapminder$pop
str(gapminder)
## tibble [1,704 × 7] (S3: tbl_df/tbl/data.frame)
##  $ country  : Factor w/ 142 levels "Afghanistan",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ continent: Factor w/ 5 levels "Africa","Americas",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ year     : int [1:1704] 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
##  $ lifeExp  : num [1:1704] 28.8 30.3 32 34 36.1 ...
##  $ pop      : int [1:1704] 8425333 9240934 10267083 11537966 13079460 14880372 12881816 13867957 16317921 22227415 ...
##  $ gdpPercap: num [1:1704] 779 821 853 836 740 ...
##  $ TotalGDP : num [1:1704] 6.57e+09 7.59e+09 8.76e+09 9.65e+09 9.68e+09 ...

Analogous to the with() function, there is a function within() which can simplify the syntax. Whereas with() does not change the data frame, within() can. Note, below I first remove the altered gapminder dataframe using rm() then bring a clean copy back in by reloading the gapminder package.

rm(gapminder)
library(gapminder)
str(gapminder)
## tibble [1,704 × 6] (S3: tbl_df/tbl/data.frame)
##  $ country  : Factor w/ 142 levels "Afghanistan",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ continent: Factor w/ 5 levels "Africa","Americas",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ year     : int [1:1704] 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
##  $ lifeExp  : num [1:1704] 28.8 30.3 32 34 36.1 ...
##  $ pop      : int [1:1704] 8425333 9240934 10267083 11537966 13079460 14880372 12881816 13867957 16317921 22227415 ...
##  $ gdpPercap: num [1:1704] 779 821 853 836 740 ...
gapminder <- within(gapminder, TotalGDP <- gdpPercap * pop)
str(gapminder)
## tibble [1,704 × 7] (S3: tbl_df/tbl/data.frame)
##  $ country  : Factor w/ 142 levels "Afghanistan",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ continent: Factor w/ 5 levels "Africa","Americas",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ year     : int [1:1704] 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
##  $ lifeExp  : num [1:1704] 28.8 30.3 32 34 36.1 ...
##  $ pop      : int [1:1704] 8425333 9240934 10267083 11537966 13079460 14880372 12881816 13867957 16317921 22227415 ...
##  $ gdpPercap: num [1:1704] 779 821 853 836 740 ...
##  $ TotalGDP : num [1:1704] 6.57e+09 7.59e+09 8.76e+09 9.65e+09 9.68e+09 ...

A nice feature of within() is its ability to add more than one variable at a time to a data frame. In this case the two or more formulas creating new variables must be enclosed in braces.

gapminder <- within(gapminder, {TotalGDP <- gdpPercap * pop
    lifeExpMonths <- lifeExp * 12})
str(gapminder)
## tibble [1,704 × 8] (S3: tbl_df/tbl/data.frame)
##  $ country      : Factor w/ 142 levels "Afghanistan",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ continent    : Factor w/ 5 levels "Africa","Americas",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ year         : int [1:1704] 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
##  $ lifeExp      : num [1:1704] 28.8 30.3 32 34 36.1 ...
##  $ pop          : int [1:1704] 8425333 9240934 10267083 11537966 13079460 14880372 12881816 13867957 16317921 22227415 ...
##  $ gdpPercap    : num [1:1704] 779 821 853 836 740 ...
##  $ TotalGDP     : num [1:1704] 6.57e+09 7.59e+09 8.76e+09 9.65e+09 9.68e+09 ...
##  $ lifeExpMonths: num [1:1704] 346 364 384 408 433 ...

6.5.2 Removing Variables

After reflection we may realize the new variables we added to the gapminder data frame are not useful, and should be removed.

str(gapminder)
## tibble [1,704 × 8] (S3: tbl_df/tbl/data.frame)
##  $ country      : Factor w/ 142 levels "Afghanistan",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ continent    : Factor w/ 5 levels "Africa","Americas",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ year         : int [1:1704] 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
##  $ lifeExp      : num [1:1704] 28.8 30.3 32 34 36.1 ...
##  $ pop          : int [1:1704] 8425333 9240934 10267083 11537966 13079460 14880372 12881816 13867957 16317921 22227415 ...
##  $ gdpPercap    : num [1:1704] 779 821 853 836 740 ...
##  $ TotalGDP     : num [1:1704] 6.57e+09 7.59e+09 8.76e+09 9.65e+09 9.68e+09 ...
##  $ lifeExpMonths: num [1:1704] 346 364 384 408 433 ...
gapminder <- gapminder[1:6]
str(gapminder)
## tibble [1,704 × 6] (S3: tbl_df/tbl/data.frame)
##  $ country  : Factor w/ 142 levels "Afghanistan",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ continent: Factor w/ 5 levels "Africa","Americas",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ year     : int [1:1704] 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
##  $ lifeExp  : num [1:1704] 28.8 30.3 32 34 36.1 ...
##  $ pop      : int [1:1704] 8425333 9240934 10267083 11537966 13079460 14880372 12881816 13867957 16317921 22227415 ...
##  $ gdpPercap: num [1:1704] 779 821 853 836 740 ...

The same result could be obtained via gapminder <- gapminder[, 1:6]. The first method uses the fact that a data frame is also a list, and uses list subsetting methods. It is slightly preferable, since even if only one variable is retained, the object will still be a data frame, while the other method can return a vector in this case. Note this difference in the resulting x variable below (again this behavior can be frustrating at times if it is not anticipated).

a <- data.frame(x = 1:3, y = c("dog", "cat", "pig"), 
                z = seq(from = 1, to = 2, length = 3))
a
##   x   y   z
## 1 1 dog 1.0
## 2 2 cat 1.5
## 3 3 pig 2.0
a <- a[1]
a 
##   x
## 1 1
## 2 2
## 3 3
a <- data.frame(x = 1:3, y = c("dog", "cat", "pig"), 
                z = seq(from = 1, to = 2, length = 3))
a 
##   x   y   z
## 1 1 dog 1.0
## 2 2 cat 1.5
## 3 3 pig 2.0
a <- a[,1]
a
## [1] 1 2 3

One can also use a negative sign in front of the variable number(s). For example, a[-(2:3)] would drop the last two columns of a. Some care is needed when removing variables using the negative sign.

An alternative approach is to set the variables you’d like to remove to NULL. For example, a[c("y","z")] <- NULL and a[,2:3] <- NULL produce the same result as above.

6.5.3 Practice Problem

What happens if you write a[-2:3] instead of a[-(2:3)]? Why are the parentheses important here?

6.5.4 Transforming Variables

Consider the gapminder data again. Possibly we don’t want to add a new variable that gives life expectancy in months, but rather want to modify the existing variable to measure life expectancy in months. Here are two ways to accomplish this.

rm(gapminder)
library(gapminder)
gapminder$lifeExp[1:5]
## [1] 28.80 30.33 32.00 34.02 36.09
gapminder$lifeExp <- gapminder$lifeExp * 12
gapminder$lifeExp[1:5]
## [1] 345.6 364.0 384.0 408.2 433.1
rm(gapminder)
library(gapminder)
gapminder$lifeExp[1:5]
## [1] 28.80 30.33 32.00 34.02 36.09
gapminder <- within(gapminder, lifeExp <- lifeExp * 12)
gapminder$lifeExp[1:5]
## [1] 345.6 364.0 384.0 408.2 433.1

6.5.5 Rearranging Variables

Consider the full weather data set again.

u.weather <- "https://www.finley-lab.com/files/data/WeatherKLAN2014Full.csv"
WeatherKLAN2014Full <- read.csv(u.weather, header=TRUE, 
                                stringsAsFactors = FALSE,
                                na.string = "")
names(WeatherKLAN2014Full)
##  [1] "EST"                      
##  [2] "Max.TemperatureF"         
##  [3] "Mean.TemperatureF"        
##  [4] "Min.TemperatureF"         
##  [5] "Max.Dew.PointF"           
##  [6] "MeanDew.PointF"           
##  [7] "Min.DewpointF"            
##  [8] "Max.Humidity"             
##  [9] "Mean.Humidity"            
## [10] "Min.Humidity"             
## [11] "Max.Sea.Level.PressureIn" 
## [12] "Mean.Sea.Level.PressureIn"
## [13] "Min.Sea.Level.PressureIn" 
## [14] "Max.VisibilityMiles"      
## [15] "Mean.VisibilityMiles"     
## [16] "Min.VisibilityMiles"      
## [17] "Max.Wind.SpeedMPH"        
## [18] "Mean.Wind.SpeedMPH"       
## [19] "Max.Gust.SpeedMPH"        
## [20] "PrecipitationIn"          
## [21] "CloudCover"               
## [22] "Events"                   
## [23] "WindDirDegrees"

If we want the wind speed variables to come right after the date, we can again use subsetting.

WeatherKLAN2014Full <- WeatherKLAN2014Full[c(1,17, 18, 19, 2:16, 20:23)]
names(WeatherKLAN2014Full)
##  [1] "EST"                      
##  [2] "Max.Wind.SpeedMPH"        
##  [3] "Mean.Wind.SpeedMPH"       
##  [4] "Max.Gust.SpeedMPH"        
##  [5] "Max.TemperatureF"         
##  [6] "Mean.TemperatureF"        
##  [7] "Min.TemperatureF"         
##  [8] "Max.Dew.PointF"           
##  [9] "MeanDew.PointF"           
## [10] "Min.DewpointF"            
## [11] "Max.Humidity"             
## [12] "Mean.Humidity"            
## [13] "Min.Humidity"             
## [14] "Max.Sea.Level.PressureIn" 
## [15] "Mean.Sea.Level.PressureIn"
## [16] "Min.Sea.Level.PressureIn" 
## [17] "Max.VisibilityMiles"      
## [18] "Mean.VisibilityMiles"     
## [19] "Min.VisibilityMiles"      
## [20] "PrecipitationIn"          
## [21] "CloudCover"               
## [22] "Events"                   
## [23] "WindDirDegrees"

6.6 Reshaping Data

A data set can be represented in several different formats. Consider a (fictitious) data set on incomes of three people during three different years. Here is one representation of the data:

yearlyIncomeWide
##           name income1990 income2000 income2010
## 1   John Smith      29784      39210      41213
## 2     Jane Doe      56789      89321     109321
## 3 Albert Jones       2341      34567      56781

Here is another representation of the same data:

yearlyIncomeLong
##           name       year income
## 1   John Smith income1990  29784
## 2     Jane Doe income1990  56789
## 3 Albert Jones income1990   2341
## 4   John Smith income2000  39210
## 5     Jane Doe income2000  89321
## 6 Albert Jones income2000  34567
## 7   John Smith income2010  41213
## 8     Jane Doe income2010 109321
## 9 Albert Jones income2010  56781

For hopefully obvious reasons, the first representation is called a wide representation of the data, and the second is called a long representation. Each has its merits. The first representation is probably easier for people to read, while the second is often the form needed for analysis by statistical software such as R. There are of course other representations. For example the rows and columns could be interchanged to create a different wide representation, or the long representation, which currently groups data by year, could group by name instead.

Whatever the relative merits of wide and long representations of data, transforming data from wide to long or long to wide is often required. As with many tasks, there are several ways to accomplish this in R. We will focus on a library called tidyr written by Hadley Wickham that performs the transformations and more.

6.6.1 tidyr

The R library tidyr has functions for converting data between formats. To illustrate its use, we examine a simple data set that explores the relationship between religion and income in the United States. The data come from a Pew survey, and are used in the tidyr documentation to illustrate transforming data from wide to long format.

u.rel <- "https://www.finley-lab.com/files/data/religion2.csv"
religion <- read.csv(u.rel, header=TRUE, stringsAsFactors = FALSE)
head(religion)
##             religion under10k btw10and20k btw20and30k
## 1           Agnostic       27          34          60
## 2            Atheist       12          27          37
## 3           Buddhist       27          21          30
## 4           Catholic      418         617         732
## 5 DoNotKnowOrRefused       15          14          15
## 6    EvangelicalProt      575         869        1064
##   btw30and40k btw40and50k btw50and75k btw75and100k
## 1          81          76         137          122
## 2          52          35          70           73
## 3          34          33          58           62
## 4         670         638        1116          949
## 5          11          10          35           21
## 6         982         881        1486          949
##   btw100and150k over150k DoNotKnowOrRefused
## 1           109       84                 96
## 2            59       74                 76
## 3            39       53                 54
## 4           792      633               1489
## 5            17       18                116
## 6           723      414               1529

As given, the columns include religion and income level, and there are counts for each of the combinations of religion and income level. For example, there are 27 people who are Agnostic and whose income is less than 10 thousand dollars, and there are 617 people who are Catholic and whose income is between 10 and 20 thousand dollars.

The pivot_longer() function can transform data from wide to long format.

library(tidyr)
religionLong <- pivot_longer(data = religion, cols = 2:11, 
                 names_to = 'IncomeLevel', values_to = 'Frequency')
head(religionLong)
## # A tibble: 6 × 3
##   religion IncomeLevel Frequency
##   <chr>    <chr>           <int>
## 1 Agnostic under10k           27
## 2 Agnostic btw10and20k        34
## 3 Agnostic btw20and30k        60
## 4 Agnostic btw30and40k        81
## 5 Agnostic btw40and50k        76
## 6 Agnostic btw50and75k       137
tail(religionLong)
## # A tibble: 6 × 3
##   religion     IncomeLevel        Frequency
##   <chr>        <chr>                  <int>
## 1 Unaffiliated btw40and50k              341
## 2 Unaffiliated btw50and75k              528
## 3 Unaffiliated btw75and100k             407
## 4 Unaffiliated btw100and150k            321
## 5 Unaffiliated over150k                 258
## 6 Unaffiliated DoNotKnowOrRefused       597

To use pivot_longer() we specified the data frame (data = religion), the columns we want to pivot into longer format (cols = 2:11), the name we want to give the column created from the income levels (names_to = 'IncomeLevel'), and the name we want to give to the column containing the frequency values (values_to = 'Frequency').

Columns to be pivoted into longer format can be specified by name also, and we can also specify which columns should be omitted using a negative sign in front of the name(s). So the following creates an equivalent data frame:

religionLong <- pivot_longer(data = religion, cols = -religion, 
                     names_to = 'IncomeLevel', values_to = 'Frequency') 
head(religionLong)
## # A tibble: 6 × 3
##   religion IncomeLevel Frequency
##   <chr>    <chr>           <int>
## 1 Agnostic under10k           27
## 2 Agnostic btw10and20k        34
## 3 Agnostic btw20and30k        60
## 4 Agnostic btw30and40k        81
## 5 Agnostic btw40and50k        76
## 6 Agnostic btw50and75k       137

We now use the pivot_wider() function to widen the religionLong data set back into the original wide format.

religionWide <- pivot_wider(data = religionLong, 
                names_from = IncomeLevel, 
                values_from = Frequency)
head(religionWide)
## # A tibble: 6 × 11
##   religion under10k btw10and20k btw20and30k btw30and40k
##   <chr>       <int>       <int>       <int>       <int>
## 1 Agnostic       27          34          60          81
## 2 Atheist        12          27          37          52
## 3 Buddhist       27          21          30          34
## 4 Catholic      418         617         732         670
## 5 DoNotKn…       15          14          15          11
## 6 Evangel…      575         869        1064         982
## # … with 6 more variables: btw40and50k <int>,
## #   btw50and75k <int>, btw75and100k <int>,
## #   btw100and150k <int>, over150k <int>,
## #   DoNotKnowOrRefused <int>

Here we specify the data frame (data = religionLong), the column (names_from = IncomeLevel) to get the name of the output column, and the column of values (values_from = Frequency) to get the cell values from. As can be seen, this particular call to pivot_longer() yields the original data frame.

tidyr provides two other useful functions to separate and unite variables based on some deliminator. Consider again the yearlyIncomeWide table. Say we want to split the name variable into first and last name. This can be done using the separate() function.

firstLast <- separate(data = yearlyIncomeLong, col = name, 
                      into = c("first", "last"), sep="\\s")
print(firstLast)
##    first  last       year income
## 1   John Smith income1990  29784
## 2   Jane   Doe income1990  56789
## 3 Albert Jones income1990   2341
## 4   John Smith income2000  39210
## 5   Jane   Doe income2000  89321
## 6 Albert Jones income2000  34567
## 7   John Smith income2010  41213
## 8   Jane   Doe income2010 109321
## 9 Albert Jones income2010  56781

Now say, you’re not happy with that and you want to combine the name column again, but this time separate the first and last name with a underscore. This is done using the unite() function.

unite(firstLast, col=name, first, last, sep="_")
##           name       year income
## 1   John_Smith income1990  29784
## 2     Jane_Doe income1990  56789
## 3 Albert_Jones income1990   2341
## 4   John_Smith income2000  39210
## 5     Jane_Doe income2000  89321
## 6 Albert_Jones income2000  34567
## 7   John_Smith income2010  41213
## 8     Jane_Doe income2010 109321
## 9 Albert_Jones income2010  56781

6.6.2 Practice Problem

Consider the data.birds data frame produced below

birds <- c(10, 38, 29, 88, 42, 177, 200)
recordingInfo <- c("LA-2017-01-01", "DF-2011-03-02", "OG-2078-05-11", "YA-2000-11-18", 
                   "LA-2019-03-17", "OG-2016-10-10", "YA-2001-03-22")
data.birds <- data.frame(birds, recordingInfo)
data.birds
##   birds recordingInfo
## 1    10 LA-2017-01-01
## 2    38 DF-2011-03-02
## 3    29 OG-2078-05-11
## 4    88 YA-2000-11-18
## 5    42 LA-2019-03-17
## 6   177 OG-2016-10-10
## 7   200 YA-2001-03-22

The separate() function from the tidyr library is especially useful when working with real data as multiple pieces of information can be combined into one column in a data set.

The column “RecordingInfo” contains the site where data was collected, as well as the year, month, and date the data were recorded. The data are coded as follows: site-year-month-day. Write a line of code that will extract the desired data from the data.birds data frame into separate columns named site, year, month and day.

6.7 Manipulating Data with dplyr

Much of the effort (a figure of 80% is sometimes suggested) in data analysis is spent cleaning the data and getting it ready for analysis. Having effective tools for this task can save substantial time and effort. The R package dplyr written by Hadley Wickham is designed, in Hadley’s words, to be “a grammar of data manipulation, providing a consistent set of verbs that help you solve the most common data manipulation challenges.” Casting data analysis tasks in terms of “grammar” should be familiar from our work with the ggplot2 package, which was also authored by Hadley. Functions provided by dplyr do in fact capture key data analysis actions (i.e., verbs). These functions include

  • mutate() adds new variables that are functions of existing variables
  • select() picks variables based on their names
  • filter() picks cases based on their values
  • summarize() reduces multiple values down to a single summary
  • arrange() changes the ordering of the rows.

These all combine naturally with a group_by() function that allows you to perform any operation grouped by values of one or more variables. All the tasks done using dplyr can be accomplished using tools already covered in this text; however, dplyr’s functions provide a potentially more efficient and convenient framework to accomplish these tasks. RStudio provides a convenient data wrangling cheat sheet that covers many aspects of the tidyr and dplyr packages.

This somewhat long section on dplyr adapts the nice introduction by Jenny Bryan, available at https://stat545.com/dplyr-intro.html.

6.7.1 Improved Data Frames

The dplyr package provides a couple functions that offer improvements on data frames. First, tibble creates a tibble from a series of vectors39. A tibble has two advantages over a data frame. First, when printing, it only prints the first ten rows and the columns that fit on the page, as well as some additional information about the table’s dimension, data type of variables, and non-printed columns. Second, recall that subsetting a data frame can sometimes return a vector rather than a data frame (if only one row or column is the result of the subset), a tibble does not have this behavior. Second, we can use the as_tibble() function to convert an existing data frame or list into a tibble. Here is an example using the religionWide data frame.40

library(dplyr)
head(religionWide)
## # A tibble: 6 × 11
##   religion under10k btw10and20k btw20and30k btw30and40k
##   <chr>       <int>       <int>       <int>       <int>
## 1 Agnostic       27          34          60          81
## 2 Atheist        12          27          37          52
## 3 Buddhist       27          21          30          34
## 4 Catholic      418         617         732         670
## 5 DoNotKn…       15          14          15          11
## 6 Evangel…      575         869        1064         982
## # … with 6 more variables: btw40and50k <int>,
## #   btw50and75k <int>, btw75and100k <int>,
## #   btw100and150k <int>, over150k <int>,
## #   DoNotKnowOrRefused <int>
religionWide[,1]
## # A tibble: 18 × 1
##    religion             
##    <chr>                
##  1 Agnostic             
##  2 Atheist              
##  3 Buddhist             
##  4 Catholic             
##  5 DoNotKnowOrRefused   
##  6 EvangelicalProt      
##  7 Hindu                
##  8 HistoricallyBlackProt
##  9 JehovahsWitness      
## 10 Jewish               
## 11 MainlineProt         
## 12 Mormon               
## 13 Muslim               
## 14 Orthodox             
## 15 OtherChristian       
## 16 OtherFaiths          
## 17 OtherWorldReligions  
## 18 Unaffiliated
religionWideTbl <- as_tibble(religionWide)
head(religionWideTbl)
## # A tibble: 6 × 11
##   religion under10k btw10and20k btw20and30k btw30and40k
##   <chr>       <int>       <int>       <int>       <int>
## 1 Agnostic       27          34          60          81
## 2 Atheist        12          27          37          52
## 3 Buddhist       27          21          30          34
## 4 Catholic      418         617         732         670
## 5 DoNotKn…       15          14          15          11
## 6 Evangel…      575         869        1064         982
## # … with 6 more variables: btw40and50k <int>,
## #   btw50and75k <int>, btw75and100k <int>,
## #   btw100and150k <int>, over150k <int>,
## #   DoNotKnowOrRefused <int>
religionWideTbl[,1]
## # A tibble: 18 × 1
##    religion             
##    <chr>                
##  1 Agnostic             
##  2 Atheist              
##  3 Buddhist             
##  4 Catholic             
##  5 DoNotKnowOrRefused   
##  6 EvangelicalProt      
##  7 Hindu                
##  8 HistoricallyBlackProt
##  9 JehovahsWitness      
## 10 Jewish               
## 11 MainlineProt         
## 12 Mormon               
## 13 Muslim               
## 14 Orthodox             
## 15 OtherChristian       
## 16 OtherFaiths          
## 17 OtherWorldReligions  
## 18 Unaffiliated

As seen above, note that once the data frame is reduced to one dimension by subsetting to one column, it is no longer a data frame and has been simplified to a vector. This might not seem like a big deal; however, it can be very frustrating and potentially break your code when you expect an object to behave like a data frame and it doesn’t because it’s now a vector. Alternatively, once we convert religionWide to a tibble via the as_tibble() function the object remains a data frame even when subsetting down to one dimension (there is no automatic simplification). Converting data frames using as_tibble() is not required for using dplyr but is convenient. Also, it is important to note that tibble is simply a wrapper around a data frame that provides some additional behaviors. The newly formed tibble object will still behave like a data frame (because it technically still is a data frame) but will have some added niceties (some of which are illustrated below).

6.7.2 Filtering Data by Row

Recall the gapminder data. These data are available in tab-separated format in gapminder.tsv, and can be read in using read.delim() (or the related read functions described previously). The read.delim() function defaults to header = TRUE so this doesn’t need to be specified explicitly. In this section we will be working with the gapminder data often, so we will use a short name for the data frame to save typing.

u.gm <- "https://www.finley-lab.com/files/data/gapminder.tsv"
gm <- read.delim(u.gm)
gm <- as_tibble(gm)
str(gm)
## tibble [1,704 × 6] (S3: tbl_df/tbl/data.frame)
##  $ country  : chr [1:1704] "Afghanistan" "Afghanistan" "Afghanistan" "Afghanistan" ...
##  $ year     : int [1:1704] 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
##  $ pop      : num [1:1704] 8425333 9240934 10267083 11537966 13079460 ...
##  $ continent: chr [1:1704] "Asia" "Asia" "Asia" "Asia" ...
##  $ lifeExp  : num [1:1704] 28.8 30.3 32 34 36.1 ...
##  $ gdpPercap: num [1:1704] 779 821 853 836 740 ...
head(gm)
## # A tibble: 6 × 6
##   country      year     pop continent lifeExp gdpPercap
##   <chr>       <int>   <dbl> <chr>       <dbl>     <dbl>
## 1 Afghanistan  1952  8.43e6 Asia         28.8      779.
## 2 Afghanistan  1957  9.24e6 Asia         30.3      821.
## 3 Afghanistan  1962  1.03e7 Asia         32.0      853.
## 4 Afghanistan  1967  1.15e7 Asia         34.0      836.
## 5 Afghanistan  1972  1.31e7 Asia         36.1      740.
## 6 Afghanistan  1977  1.49e7 Asia         38.4      786.

Filtering helps us to examine subsets of the data such as data from a particular country, from several specified countries, from certain years, from countries with certain populations, etc. Some examples:

filter(gm, country == "Brazil")
## # A tibble: 12 × 6
##    country  year       pop continent lifeExp gdpPercap
##    <chr>   <int>     <dbl> <chr>       <dbl>     <dbl>
##  1 Brazil   1952  56602560 Americas     50.9     2109.
##  2 Brazil   1957  65551171 Americas     53.3     2487.
##  3 Brazil   1962  76039390 Americas     55.7     3337.
##  4 Brazil   1967  88049823 Americas     57.6     3430.
##  5 Brazil   1972 100840058 Americas     59.5     4986.
##  6 Brazil   1977 114313951 Americas     61.5     6660.
##  7 Brazil   1982 128962939 Americas     63.3     7031.
##  8 Brazil   1987 142938076 Americas     65.2     7807.
##  9 Brazil   1992 155975974 Americas     67.1     6950.
## 10 Brazil   1997 168546719 Americas     69.4     7958.
## 11 Brazil   2002 179914212 Americas     71.0     8131.
## 12 Brazil   2007 190010647 Americas     72.4     9066.
filter(gm, country %in% c("Brazil", "Mexico"))
## # A tibble: 24 × 6
##    country  year       pop continent lifeExp gdpPercap
##    <chr>   <int>     <dbl> <chr>       <dbl>     <dbl>
##  1 Brazil   1952  56602560 Americas     50.9     2109.
##  2 Brazil   1957  65551171 Americas     53.3     2487.
##  3 Brazil   1962  76039390 Americas     55.7     3337.
##  4 Brazil   1967  88049823 Americas     57.6     3430.
##  5 Brazil   1972 100840058 Americas     59.5     4986.
##  6 Brazil   1977 114313951 Americas     61.5     6660.
##  7 Brazil   1982 128962939 Americas     63.3     7031.
##  8 Brazil   1987 142938076 Americas     65.2     7807.
##  9 Brazil   1992 155975974 Americas     67.1     6950.
## 10 Brazil   1997 168546719 Americas     69.4     7958.
## # … with 14 more rows
filter(gm, country %in% c("Brazil", "Mexico") & year %in% c(1952, 1972))
## # A tibble: 4 × 6
##   country  year       pop continent lifeExp gdpPercap
##   <chr>   <int>     <dbl> <chr>       <dbl>     <dbl>
## 1 Brazil   1952  56602560 Americas     50.9     2109.
## 2 Brazil   1972 100840058 Americas     59.5     4986.
## 3 Mexico   1952  30144317 Americas     50.8     3478.
## 4 Mexico   1972  55984294 Americas     62.4     6809.
filter(gm, pop > 300000000)
## # A tibble: 25 × 6
##    country  year        pop continent lifeExp gdpPercap
##    <chr>   <int>      <dbl> <chr>       <dbl>     <dbl>
##  1 China    1952     5.56e8 Asia         44        400.
##  2 China    1957     6.37e8 Asia         50.5      576.
##  3 China    1962     6.66e8 Asia         44.5      488.
##  4 China    1967     7.55e8 Asia         58.4      613.
##  5 China    1972     8.62e8 Asia         63.1      677.
##  6 China    1977     9.43e8 Asia         64.0      741.
##  7 China    1982     1.00e9 Asia         65.5      962.
##  8 China    1987     1.08e9 Asia         67.3     1379.
##  9 China    1992     1.16e9 Asia         68.7     1656.
## 10 China    1997     1.23e9 Asia         70.4     2289.
## # … with 15 more rows
filter(gm, pop > 300000000 & year == 2007)
## # A tibble: 3 × 6
##   country       year    pop continent lifeExp gdpPercap
##   <chr>        <int>  <dbl> <chr>       <dbl>     <dbl>
## 1 China         2007 1.32e9 Asia         73.0     4959.
## 2 India         2007 1.11e9 Asia         64.7     2452.
## 3 United Stat…  2007 3.01e8 Americas     78.2    42952.

Notice the full results are not printed. For example, when we asked for the data for Brazil and Mexico, only the first ten rows were printed. This is an effect of using the as_tibble() function. Of course if we wanted to analyze the results (as we will below) the full set of data would be available.

6.7.3 Selecting variables by column

Continuing with the gapminder data, another common task is to restrict attention to some subset of variables in the data set. The select() function does this.

select(gm, country, year, lifeExp)
## # A tibble: 1,704 × 3
##    country      year lifeExp
##    <chr>       <int>   <dbl>
##  1 Afghanistan  1952    28.8
##  2 Afghanistan  1957    30.3
##  3 Afghanistan  1962    32.0
##  4 Afghanistan  1967    34.0
##  5 Afghanistan  1972    36.1
##  6 Afghanistan  1977    38.4
##  7 Afghanistan  1982    39.9
##  8 Afghanistan  1987    40.8
##  9 Afghanistan  1992    41.7
## 10 Afghanistan  1997    41.8
## # … with 1,694 more rows
select(gm, 2:4)
## # A tibble: 1,704 × 3
##     year      pop continent
##    <int>    <dbl> <chr>    
##  1  1952  8425333 Asia     
##  2  1957  9240934 Asia     
##  3  1962 10267083 Asia     
##  4  1967 11537966 Asia     
##  5  1972 13079460 Asia     
##  6  1977 14880372 Asia     
##  7  1982 12881816 Asia     
##  8  1987 13867957 Asia     
##  9  1992 16317921 Asia     
## 10  1997 22227415 Asia     
## # … with 1,694 more rows
select(gm, -c(2,3,4))
## # A tibble: 1,704 × 3
##    country     lifeExp gdpPercap
##    <chr>         <dbl>     <dbl>
##  1 Afghanistan    28.8      779.
##  2 Afghanistan    30.3      821.
##  3 Afghanistan    32.0      853.
##  4 Afghanistan    34.0      836.
##  5 Afghanistan    36.1      740.
##  6 Afghanistan    38.4      786.
##  7 Afghanistan    39.9      978.
##  8 Afghanistan    40.8      852.
##  9 Afghanistan    41.7      649.
## 10 Afghanistan    41.8      635.
## # … with 1,694 more rows
select(gm, starts_with("c"))
## # A tibble: 1,704 × 2
##    country     continent
##    <chr>       <chr>    
##  1 Afghanistan Asia     
##  2 Afghanistan Asia     
##  3 Afghanistan Asia     
##  4 Afghanistan Asia     
##  5 Afghanistan Asia     
##  6 Afghanistan Asia     
##  7 Afghanistan Asia     
##  8 Afghanistan Asia     
##  9 Afghanistan Asia     
## 10 Afghanistan Asia     
## # … with 1,694 more rows

Notice a few things. Variables can be selected by name or column number. As usual, a negative sign tells R to leave something out. And there are special functions such as starts_with that provide ways to match part of a variable’s name.

6.7.4 Practice Problem

Use the contains() function to select only the columns that contain a c in the gapminder data set.

6.7.5 Pipes

Consider selecting the country, year, and population for countries in Asia or Europe. One possibility is to nest a filter() function inside a select() function.

select(filter(gm, continent %in% c("Asia", "Europe")), country, year, pop)
## # A tibble: 756 × 3
##    country      year      pop
##    <chr>       <int>    <dbl>
##  1 Afghanistan  1952  8425333
##  2 Afghanistan  1957  9240934
##  3 Afghanistan  1962 10267083
##  4 Afghanistan  1967 11537966
##  5 Afghanistan  1972 13079460
##  6 Afghanistan  1977 14880372
##  7 Afghanistan  1982 12881816
##  8 Afghanistan  1987 13867957
##  9 Afghanistan  1992 16317921
## 10 Afghanistan  1997 22227415
## # … with 746 more rows

Even a two-step process like this becomes hard to follow in this nested form, and often we will want to perform more than two operations. There is a nice feature in dplyr that allows us to “feed” results of one function into the first argument of a subsequent function. Another way of saying this is that we are “piping” the results into another function. The %>% operator does the piping. Here we again restrict attention to country, year, and population for countries in Asia or Europe41.

gm %>% 
  filter(continent %in% c("Asia", "Europe")) %>% 
  select(country, year, pop)
## # A tibble: 756 × 3
##    country      year      pop
##    <chr>       <int>    <dbl>
##  1 Afghanistan  1952  8425333
##  2 Afghanistan  1957  9240934
##  3 Afghanistan  1962 10267083
##  4 Afghanistan  1967 11537966
##  5 Afghanistan  1972 13079460
##  6 Afghanistan  1977 14880372
##  7 Afghanistan  1982 12881816
##  8 Afghanistan  1987 13867957
##  9 Afghanistan  1992 16317921
## 10 Afghanistan  1997 22227415
## # … with 746 more rows

It can help to think of %>% as representing the word “then”. The above can be read as, “Start with the data frame gm then filter it to select data from the continents Asia and Europe then select the variables country, year, and population from these data”.

The pipe operator %>% is not restricted to functions in dplyr. In fact the pipe operator itself was introduced in another package called magrittr, but is included in dplyr as a convenience.

6.7.6 Arranging Data by Row

By default the gapminder data are arranged by country and then by year.

head(gm, 15)
## # A tibble: 15 × 6
##    country      year    pop continent lifeExp gdpPercap
##    <chr>       <int>  <dbl> <chr>       <dbl>     <dbl>
##  1 Afghanistan  1952 8.43e6 Asia         28.8      779.
##  2 Afghanistan  1957 9.24e6 Asia         30.3      821.
##  3 Afghanistan  1962 1.03e7 Asia         32.0      853.
##  4 Afghanistan  1967 1.15e7 Asia         34.0      836.
##  5 Afghanistan  1972 1.31e7 Asia         36.1      740.
##  6 Afghanistan  1977 1.49e7 Asia         38.4      786.
##  7 Afghanistan  1982 1.29e7 Asia         39.9      978.
##  8 Afghanistan  1987 1.39e7 Asia         40.8      852.
##  9 Afghanistan  1992 1.63e7 Asia         41.7      649.
## 10 Afghanistan  1997 2.22e7 Asia         41.8      635.
## 11 Afghanistan  2002 2.53e7 Asia         42.1      727.
## 12 Afghanistan  2007 3.19e7 Asia         43.8      975.
## 13 Albania      1952 1.28e6 Europe       55.2     1601.
## 14 Albania      1957 1.48e6 Europe       59.3     1942.
## 15 Albania      1962 1.73e6 Europe       64.8     2313.

Possibly arranging the data by year and then country would be desired. The arrange() function makes this easy. We will again use pipes.

gm %>% 
  arrange(year, country)
## # A tibble: 1,704 × 6
##    country      year    pop continent lifeExp gdpPercap
##    <chr>       <int>  <dbl> <chr>       <dbl>     <dbl>
##  1 Afghanistan  1952 8.43e6 Asia         28.8      779.
##  2 Albania      1952 1.28e6 Europe       55.2     1601.
##  3 Algeria      1952 9.28e6 Africa       43.1     2449.
##  4 Angola       1952 4.23e6 Africa       30.0     3521.
##  5 Argentina    1952 1.79e7 Americas     62.5     5911.
##  6 Australia    1952 8.69e6 Oceania      69.1    10040.
##  7 Austria      1952 6.93e6 Europe       66.8     6137.
##  8 Bahrain      1952 1.20e5 Asia         50.9     9867.
##  9 Bangladesh   1952 4.69e7 Asia         37.5      684.
## 10 Belgium      1952 8.73e6 Europe       68       8343.
## # … with 1,694 more rows

How about the data for Rwanda, arranged in order of life expectancy.

gm %>% 
  filter(country == "Rwanda") %>% 
  arrange(lifeExp)
## # A tibble: 12 × 6
##    country  year     pop continent lifeExp gdpPercap
##    <chr>   <int>   <dbl> <chr>       <dbl>     <dbl>
##  1 Rwanda   1992 7290203 Africa       23.6      737.
##  2 Rwanda   1997 7212583 Africa       36.1      590.
##  3 Rwanda   1952 2534927 Africa       40        493.
##  4 Rwanda   1957 2822082 Africa       41.5      540.
##  5 Rwanda   1962 3051242 Africa       43        597.
##  6 Rwanda   2002 7852401 Africa       43.4      786.
##  7 Rwanda   1987 6349365 Africa       44.0      848.
##  8 Rwanda   1967 3451079 Africa       44.1      511.
##  9 Rwanda   1972 3992121 Africa       44.6      591.
## 10 Rwanda   1977 4657072 Africa       45        670.
## 11 Rwanda   1982 5507565 Africa       46.2      882.
## 12 Rwanda   2007 8860588 Africa       46.2      863.

Possibly we want these data to be in decreasing (descending) order. Here, desc() is one of many dplyr helper functions.

gm %>% 
  filter(country == "Rwanda") %>% 
  arrange(desc(lifeExp))
## # A tibble: 12 × 6
##    country  year     pop continent lifeExp gdpPercap
##    <chr>   <int>   <dbl> <chr>       <dbl>     <dbl>
##  1 Rwanda   2007 8860588 Africa       46.2      863.
##  2 Rwanda   1982 5507565 Africa       46.2      882.
##  3 Rwanda   1977 4657072 Africa       45        670.
##  4 Rwanda   1972 3992121 Africa       44.6      591.
##  5 Rwanda   1967 3451079 Africa       44.1      511.
##  6 Rwanda   1987 6349365 Africa       44.0      848.
##  7 Rwanda   2002 7852401 Africa       43.4      786.
##  8 Rwanda   1962 3051242 Africa       43        597.
##  9 Rwanda   1957 2822082 Africa       41.5      540.
## 10 Rwanda   1952 2534927 Africa       40        493.
## 11 Rwanda   1997 7212583 Africa       36.1      590.
## 12 Rwanda   1992 7290203 Africa       23.6      737.

Possibly we want to include only the year and life expectancy, to make the message more stark.

gm %>% 
  filter(country == "Rwanda") %>% 
  select(year, lifeExp) %>% 
  arrange(desc(lifeExp)) 
## # A tibble: 12 × 2
##     year lifeExp
##    <int>   <dbl>
##  1  2007    46.2
##  2  1982    46.2
##  3  1977    45  
##  4  1972    44.6
##  5  1967    44.1
##  6  1987    44.0
##  7  2002    43.4
##  8  1962    43  
##  9  1957    41.5
## 10  1952    40  
## 11  1997    36.1
## 12  1992    23.6

For analyzing data in R, the order shouldn’t matter. But for presentation to human eyes, the order is important.

6.7.7 Practice Problem

It is worth your while to get comfortable with using pipes. Here is some hard-to-read code. Convert it into more easier to read code by using pipes.

arrange(select(filter(gm, country == "Afghanistan"), 
               c("year", "lifeExp")), desc(lifeExp))

6.7.8 Renaming Variables

The dplyr package has a rename function that makes renaming variables in a data frame quite easy.

gm <- rename(gm, population = pop)
head(gm)
## # A tibble: 6 × 6
##   country   year population continent lifeExp gdpPercap
##   <chr>    <int>      <dbl> <chr>       <dbl>     <dbl>
## 1 Afghani…  1952    8425333 Asia         28.8      779.
## 2 Afghani…  1957    9240934 Asia         30.3      821.
## 3 Afghani…  1962   10267083 Asia         32.0      853.
## 4 Afghani…  1967   11537966 Asia         34.0      836.
## 5 Afghani…  1972   13079460 Asia         36.1      740.
## 6 Afghani…  1977   14880372 Asia         38.4      786.

6.7.9 Data Summaries and Grouping

The summarize() function computes summary statistics using user provided functions for one or more columns of data in a data frame.

summarize(gm, meanpop = mean(population), medpop = median(population))
## # A tibble: 1 × 2
##     meanpop   medpop
##       <dbl>    <dbl>
## 1 29601212. 7023596.
##or
gm %>% 
  summarize(meanpop = mean(population), medpop = median(population))
## # A tibble: 1 × 2
##     meanpop   medpop
##       <dbl>    <dbl>
## 1 29601212. 7023596.

Often we want summaries for specific components of the data. For example, we might want the median life expectancy for each continent separately. One option is subsetting:

median(gm$lifeExp[gm$continent == "Africa"])
## [1] 47.79
median(gm$lifeExp[gm$continent == "Asia"])
## [1] 61.79
median(gm$lifeExp[gm$continent == "Europe"])
## [1] 72.24
median(gm$lifeExp[gm$continent == "Americas"])
## [1] 67.05
median(gm$lifeExp[gm$continent == "Oceania"])
## [1] 73.66

The group_by() function makes this easier, and makes the output more useful.

gm %>% 
  group_by(continent) %>% 
  summarize(medLifeExp = median(lifeExp)) 
## # A tibble: 5 × 2
##   continent medLifeExp
##   <chr>          <dbl>
## 1 Africa          47.8
## 2 Americas        67.0
## 3 Asia            61.8
## 4 Europe          72.2
## 5 Oceania         73.7

Or if we want the results ordered by the median life expectancy:

gm %>% 
  group_by(continent) %>% 
  summarize(medLifeExp = median(lifeExp)) %>% 
  arrange(medLifeExp)
## # A tibble: 5 × 2
##   continent medLifeExp
##   <chr>          <dbl>
## 1 Africa          47.8
## 2 Asia            61.8
## 3 Americas        67.0
## 4 Europe          72.2
## 5 Oceania         73.7

As another example, we calculate the number of observations we have per continent (using the n() helper function), and then, among continents, how many distinct countries are represented (using n_distinct()).

gm %>% 
  group_by(continent) %>% 
  summarize(numObs = n())
## # A tibble: 5 × 2
##   continent numObs
##   <chr>      <int>
## 1 Africa       624
## 2 Americas     300
## 3 Asia         396
## 4 Europe       360
## 5 Oceania       24
gm %>%
  group_by(continent) %>%
  summarize(n_obs = n(), n_countries = n_distinct(country))
## # A tibble: 5 × 3
##   continent n_obs n_countries
##   <chr>     <int>       <int>
## 1 Africa      624          52
## 2 Americas    300          25
## 3 Asia        396          33
## 4 Europe      360          30
## 5 Oceania      24           2

Here is a bit more involved example that calculates the minimum and maximum life expectancies for countries in Africa by year.

gm %>%
  filter(continent == "Africa") %>%
  group_by(year) %>%
  summarize(min_lifeExp = min(lifeExp), max_lifeExp = max(lifeExp))
## # A tibble: 12 × 3
##     year min_lifeExp max_lifeExp
##    <int>       <dbl>       <dbl>
##  1  1952        30          52.7
##  2  1957        31.6        58.1
##  3  1962        32.8        60.2
##  4  1967        34.1        61.6
##  5  1972        35.4        64.3
##  6  1977        36.8        67.1
##  7  1982        38.4        69.9
##  8  1987        39.9        71.9
##  9  1992        23.6        73.6
## 10  1997        36.1        74.8
## 11  2002        39.2        75.7
## 12  2007        39.6        76.4

This is interesting, but the results don’t include the countries that achieved the minimum and maximum life expectancies. Here is one way to achieve that. We will start with the minimum life expectancy. Note the rank of the minimum value will be 1.

gm %>% 
  select(country, continent, year, lifeExp) %>% 
  group_by(year) %>% 
  arrange(year) %>% 
  filter(rank(lifeExp) == 1)
## # A tibble: 12 × 4
## # Groups:   year [12]
##    country      continent  year lifeExp
##    <chr>        <chr>     <int>   <dbl>
##  1 Afghanistan  Asia       1952    28.8
##  2 Afghanistan  Asia       1957    30.3
##  3 Afghanistan  Asia       1962    32.0
##  4 Afghanistan  Asia       1967    34.0
##  5 Sierra Leone Africa     1972    35.4
##  6 Cambodia     Asia       1977    31.2
##  7 Sierra Leone Africa     1982    38.4
##  8 Angola       Africa     1987    39.9
##  9 Rwanda       Africa     1992    23.6
## 10 Rwanda       Africa     1997    36.1
## 11 Zambia       Africa     2002    39.2
## 12 Swaziland    Africa     2007    39.6

Next we add the maximum life expectancy. Here we need to better understand the desc() function, which will transform a vector into a numeric vector which will be sorted in descending order. Here are some examples.

desc(1:5)
## [1] -1 -2 -3 -4 -5
desc(c(2,3,1,5,6,-4))
## [1] -2 -3 -1 -5 -6  4
desc(c("a", "c", "b", "w", "e"))
## [1] -1 -3 -2 -5 -4

We now use this to extract the maximum life expectancy. Recall that | represents “or”. Also by default only the first few rows of a tibble object will be printed. To see all the rows we pipe the output to print(n = 24) to ask for all 24 rows to be printed.

gm %>% 
  select(country, continent, year, lifeExp) %>% 
  group_by(year) %>% 
  arrange(year) %>% 
  filter(rank(lifeExp) == 1 | rank(desc(lifeExp)) == 1) %>% 
  print(n=24)
## # A tibble: 24 × 4
## # Groups:   year [12]
##    country      continent  year lifeExp
##    <chr>        <chr>     <int>   <dbl>
##  1 Afghanistan  Asia       1952    28.8
##  2 Norway       Europe     1952    72.7
##  3 Afghanistan  Asia       1957    30.3
##  4 Iceland      Europe     1957    73.5
##  5 Afghanistan  Asia       1962    32.0
##  6 Iceland      Europe     1962    73.7
##  7 Afghanistan  Asia       1967    34.0
##  8 Sweden       Europe     1967    74.2
##  9 Sierra Leone Africa     1972    35.4
## 10 Sweden       Europe     1972    74.7
## 11 Cambodia     Asia       1977    31.2
## 12 Iceland      Europe     1977    76.1
## 13 Japan        Asia       1982    77.1
## 14 Sierra Leone Africa     1982    38.4
## 15 Angola       Africa     1987    39.9
## 16 Japan        Asia       1987    78.7
## 17 Japan        Asia       1992    79.4
## 18 Rwanda       Africa     1992    23.6
## 19 Japan        Asia       1997    80.7
## 20 Rwanda       Africa     1997    36.1
## 21 Japan        Asia       2002    82  
## 22 Zambia       Africa     2002    39.2
## 23 Japan        Asia       2007    82.6
## 24 Swaziland    Africa     2007    39.6

6.7.10 Creating New Variables

The $ notation provides a simple way to create new variables in a data frame. The mutate() function provides another, sometimes cleaner way to do this. We will use mutate() along with the lag() function to investigate changes in life expectancy over five years for the gapminder data. We’ll do this in a few steps. First, we create a variable that measures the change in life expectancy and remove the population and GDP variables that are not of interest. We have to be careful to first group by country, since we want to calculate the change in life expectancy by country.

gm %>% 
  group_by(country) %>% 
  mutate(changeLifeExp = lifeExp - lag(lifeExp, order_by = year)) %>% 
  select(-c(population, gdpPercap))
## # A tibble: 1,704 × 5
## # Groups:   country [142]
##    country      year continent lifeExp changeLifeExp
##    <chr>       <int> <chr>       <dbl>         <dbl>
##  1 Afghanistan  1952 Asia         28.8       NA     
##  2 Afghanistan  1957 Asia         30.3        1.53  
##  3 Afghanistan  1962 Asia         32.0        1.66  
##  4 Afghanistan  1967 Asia         34.0        2.02  
##  5 Afghanistan  1972 Asia         36.1        2.07  
##  6 Afghanistan  1977 Asia         38.4        2.35  
##  7 Afghanistan  1982 Asia         39.9        1.42  
##  8 Afghanistan  1987 Asia         40.8        0.968 
##  9 Afghanistan  1992 Asia         41.7        0.852 
## 10 Afghanistan  1997 Asia         41.8        0.0890
## # … with 1,694 more rows

Next, summarize by computing the largest drop in life expectancy.

gm %>% 
  group_by(country) %>% 
  mutate(changeLifeExp = lifeExp - lag(lifeExp, order_by = year)) %>% 
  select(-c(population, gdpPercap)) %>% 
  summarize(largestDropLifeExp = min(changeLifeExp))
## # A tibble: 142 × 2
##    country     largestDropLifeExp
##    <chr>                    <dbl>
##  1 Afghanistan                 NA
##  2 Albania                     NA
##  3 Algeria                     NA
##  4 Angola                      NA
##  5 Argentina                   NA
##  6 Australia                   NA
##  7 Austria                     NA
##  8 Bahrain                     NA
##  9 Bangladesh                  NA
## 10 Belgium                     NA
## # … with 132 more rows

Oops. We forgot that since we don’t have data from before 1952, the first drop will be NA. Let’s try again.

gm %>% 
  group_by(country) %>% 
  mutate(changeLifeExp = lifeExp - lag(lifeExp, order_by = year)) %>% 
  select(-c(population, gdpPercap)) %>% 
  summarize(largestDropLifeExp = min(changeLifeExp, na.rm = TRUE))
## # A tibble: 142 × 2
##    country     largestDropLifeExp
##    <chr>                    <dbl>
##  1 Afghanistan             0.0890
##  2 Albania                -0.419 
##  3 Algeria                 1.31  
##  4 Angola                 -0.0360
##  5 Argentina               0.492 
##  6 Australia               0.170 
##  7 Austria                 0.490 
##  8 Bahrain                 0.840 
##  9 Bangladesh              1.67  
## 10 Belgium                 0.5   
## # … with 132 more rows

That’s not quite what we wanted. We could arrange the results by the life expectancy drop, but it would be good to have both the continent and year printed out also. So we’ll take a slightly different approach, by arranging the results in increasing order.

gm %>% 
  group_by(country) %>% 
  mutate(changeLifeExp = lifeExp - lag(lifeExp, order_by = year)) %>% 
  select(-c(population, gdpPercap)) %>% 
  arrange(changeLifeExp)
## # A tibble: 1,704 × 5
## # Groups:   country [142]
##    country       year continent lifeExp changeLifeExp
##    <chr>        <int> <chr>       <dbl>         <dbl>
##  1 Rwanda        1992 Africa       23.6        -20.4 
##  2 Zimbabwe      1997 Africa       46.8        -13.6 
##  3 Lesotho       2002 Africa       44.6        -11.0 
##  4 Swaziland     2002 Africa       43.9        -10.4 
##  5 Botswana      1997 Africa       52.6        -10.2 
##  6 Cambodia      1977 Asia         31.2         -9.10
##  7 Namibia       2002 Africa       51.5         -7.43
##  8 South Africa  2002 Africa       53.4         -6.87
##  9 Zimbabwe      2002 Africa       40.0         -6.82
## 10 China         1962 Asia         44.5         -6.05
## # … with 1,694 more rows

That’s still not quite right. Because the data are grouped by country, R did the ordering within group. If we want to see the largest drops overall, we need to remove the grouping.

gm %>% 
  group_by(country) %>% 
  mutate(changeLifeExp = lifeExp - lag(lifeExp, order_by = year)) %>% 
  select(-c(population, gdpPercap)) %>% 
  ungroup() %>% 
  arrange(changeLifeExp) %>% 
  print(n=20)
## # A tibble: 1,704 × 5
##    country       year continent lifeExp changeLifeExp
##    <chr>        <int> <chr>       <dbl>         <dbl>
##  1 Rwanda        1992 Africa       23.6        -20.4 
##  2 Zimbabwe      1997 Africa       46.8        -13.6 
##  3 Lesotho       2002 Africa       44.6        -11.0 
##  4 Swaziland     2002 Africa       43.9        -10.4 
##  5 Botswana      1997 Africa       52.6        -10.2 
##  6 Cambodia      1977 Asia         31.2         -9.10
##  7 Namibia       2002 Africa       51.5         -7.43
##  8 South Africa  2002 Africa       53.4         -6.87
##  9 Zimbabwe      2002 Africa       40.0         -6.82
## 10 China         1962 Asia         44.5         -6.05
## 11 Botswana      2002 Africa       46.6         -5.92
## 12 Zambia        1997 Africa       40.2         -5.86
## 13 Iraq          1992 Asia         59.5         -5.58
## 14 Liberia       1992 Africa       40.8         -5.23
## 15 Cambodia      1972 Asia         40.3         -5.10
## 16 Kenya         1997 Africa       54.4         -4.88
## 17 Somalia       1992 Africa       39.7         -4.84
## 18 Zambia        1992 Africa       46.1         -4.72
## 19 Swaziland     2007 Africa       39.6         -4.26
## 20 Uganda        1997 Africa       44.6         -4.25
## # … with 1,684 more rows

6.7.11 Practice Problem

As you progress in your data science related career, we are sure you will find dplyr as one of the most useful packages for your initial data exploration. Here is one more Practice Problem to get more comfortable with the syntax.

Recall the iris data set we have worked with multiple times. We want to look at the ratio of Sepal.Length to Petal.Length in the three different species. Write a series of dplyr statements that groups the data by species, creates a new column called s.p.ratio that is the Sepal.Length divided by Petal.Length, then computes the mean of this column for each species in a column called mean.ratio. Display the data in descending order of mean.ratio.

6.8 Exercises

Exercise 7 Learning objectives: introduce with(), tapply(), and cut() functions; summarize data using the table() function with logical subsetting; practice using factor data types.

Exercise 8 Learning objectives: work with messy data; import data from an external spreadsheet; practice using functions in tidyr and graphing functions.

Exercise 9 Learning objectives: work with several key dplyr functions; manipulate data frames (actually tibbles); summarize and visualize data from large data files.

References

Bray, Tim, Jean Paoli, C. M. Sperberg-McQueen, Eve Maler, and François Yergeau. 2008. “Extensible Markup Language (Xml) 1.0 (Fifth Edition).”

Burrill, Elizabeth A., Andrea M. Wilson, Jeffery A. Turner, Scott A. Pugh, James Menlove, Glenn Christiansen, Barbara L. Conkling, and Winnie David. 2018. “The Forest Inventory and Analysis Database: Database Description and User Guide Version 8.0 for Phase 2.” U.S. Department of Agriculture, Forest Service. 946 P. http://www.fia.fs.fed.us/library/database-documentation/.

Jones, Matthew, Margaret O’Brien, Bryce Mecum, Carl Boettiger, Mark Schildhauer, Mitchell Maier, Timothy Whiteaker, Stevan Earl, and Steven Chong. 2019. “Ecological Metadata Language Version 2.2.0.” https://doi.org/10.5063/f11834t2.

Pezoa, Felipe, Juan L Reutter, Fernando Suarez, Martı́n Ugarte, and Domagoj Vrgoč. 2016. “Foundations of Json Schema.” In Proceedings of the 25th International Conference on World Wide Web, 263–73. International World Wide Web Conferences Steering Committee.

Spector, Phil. 2008. Data Manipulation with R. Use R! pub-sv:adr: pub-sv.


  1. These data come from the MASS R library.↩︎

  2. The use of the $ will become clear in Chapter 4.↩︎

  3. Recall NA stands for “Not Available”.↩︎

  4. Look at the help page for head. The default value is n = 6.↩︎

  5. Reminds me of The Trouble with Tribbles↩︎

  6. The text printed immediately after library(dplyr) means the stats and base packages, which are automatically loaded when you start R, have functions with the same name as functions in dplyr. So, for example, if you call the filter() or lag() functions, R will use library(dplyr)‘s functions. Use the :: operator to explicity identify which packages’ function you want to use, e.g., if you want stats’s lag() then call stats::lag().↩︎

  7. Notice the indentation used in the code. This is not necessary, as the code could be all on one line, but I often find it easier to read in this more organized format↩︎