Chapter 4 Data Structures

A data structure is a format for organizing and storing data. The structure is designed so that data can be accessed and worked with in specific ways. Statistical software and programming languages have methods (or functions) designed to operate on different kinds of data structures.

This chapter’s focus is on data structures. To help initial understanding, the data in this chapter will be relatively modest in size and complexity. The ideas and methods, however, generalize to larger and more complex data sets.

The base data structures in R are vectors, matrices, arrays, data frames, and lists. The first three, vectors, matrices, and arrays, require all elements to be of the same type or homogeneous, e.g., all numeric or all character. Data frames and lists allow elements to be of different types or heterogeneous, e.g., some elements of a data frame may be numeric while other elements may be character. These base structures can also be organized by their dimensionality, i.e., 1-dimensional, 2-dimensional, or N-dimensional, as shown in Table 4.1.

TABLE 4.1: Dimension and type content of base data structures in R
Dimension Homogeneous Heterogeneous
1 Atomic vector List
2 Matrix Data frame
N Array

R has no scalar types, i.e., 0-dimensional. Individual numbers or strings are actually vectors of length one.

An efficient way to understand what comprises a given object is to use the str() function. str() is short for structure and prints a compact, human-readable description of any R data structure. For example, in the code below, we prove to ourselves that what we might think of as a scalar value is actually a vector of length one.

a <- 1
str(a)
##  num 1
is.vector(a)
## [1] TRUE
length(a)
## [1] 1

Here we assigned a the scalar value one. The str(a) prints num 1, which says a is numeric of length one. Then just to be sure we used the function is.vector() to test if a is in fact a vector. Then, just for fun, we asked the length of a, which again returns one. There are a set of similar logical tests for the other base data structures, e.g., is.matrix(), is.array(), is.data.frame(), and is.list(). These will all come in handy as we encounter different R objects.

4.1 Vectors

Think of a vector23 as a structure to represent one variable in a data set. For example a vector might hold the weights, in pounds, of 7 people in a data set. Or another vector might hold the genders of those 7 people. The c() function in R is useful for creating (small) vectors and for modifying existing vectors. Think of c as standing for “combine”.

weight <- c(123, 157, 205, 199, 223, 140, 105)
weight
## [1] 123 157 205 199 223 140 105
gender <- c("female", "female", "male", "female", "male",
            "male", "female")
gender
## [1] "female" "female" "male"   "female" "male"  
## [6] "male"   "female"

Notice that elements of a vector are separated by commas when using the c() function to create a vector. Also notice that character values are placed inside quotation marks.

The c() function also can be used to add to an existing vector. For example, if an eighth male person was included in the data set, and his weight was 194 pounds, the existing vectors could be modified as follows.

weight <- c(weight, 194)
gender <- c(gender, "male")
weight
## [1] 123 157 205 199 223 140 105 194
gender
## [1] "female" "female" "male"   "female" "male"  
## [6] "male"   "female" "male"

4.1.1 Types, Conversion, Coercion

Clearly it is important to distinguish between different types of vectors. For example, it makes sense to ask R to calculate the mean of the weights stored in weight, but does not make sense to ask R to compute the mean of the genders stored in gender. Vectors in R may have one of six different “types”: character, double, integer, logical, complex, and raw. Vectors in R may have one of six different “types”: character, double, integer, logical, complex, and raw. We will not encounter the complex and raw types in everyday data analysis, and so we focus on the first four data types.

  1. character: consists of letters or words. Our vector gender is a character vector because it consists of the genders for each person in our dataset.
typeof(gender)
## [1] "character"
  1. double: a numeric object that can be an integer or non-integer value (e.g., 10, 4.2). Our vector weight is a double vector.
typeof(weight)
## [1] "double"
  1. integer: a numeric object that can only be an integer. It may be surprising to see the weight variable weight is of type double, even though its values are all integers. By default, R creates a double type vector when numeric values are given via the c function. We can create an integer vector of weight variables by placing the letter L next to each of the numbers when we place it in the vector:
weight.int <- c(123L, 157L, 205L, 199L, 223L, 140L, 105L, 194L)
typeof(weight.int)
## [1] "integer"
  1. logical: used to represent variables that can take values TRUE or FALSE. To illustrate logical vectors, imagine that each of the eight people in the data setwas asked whether they were taking blood pressure medication, and the responses were coded as TRUE if the person answered yes, and FALSE if the person answered no.
bp <- c(TRUE, TRUE, FALSE, TRUE, FALSE, FALSE, TRUE, TRUE)
bp
## [1]  TRUE  TRUE FALSE  TRUE FALSE FALSE  TRUE  TRUE
typeof(bp)
## [1] "logical"

When it makes sense, it is possible to convert vectors to a different type. Consider the following examples.

weight.int <- as.integer(weight)
weight.int
## [1] 123 157 205 199 223 140 105 194
typeof(weight.int)
## [1] "integer"
weight.char <- as.character(weight)
weight.char
## [1] "123" "157" "205" "199" "223" "140" "105" "194"
bp.double <- as.double(bp)
bp.double
## [1] 1 1 0 1 0 0 1 1
gender.oops <- as.double(gender)
## Warning: NAs introduced by coercion
gender.oops
## [1] NA NA NA NA NA NA NA NA
sum(bp)
## [1] 5

The integer version of weight doesn’t look any different, but it is stored differently, which can be important both for computational efficiency and for interfacing with other languages such as C++. As noted above, however, we will not worry about the distinction between integer and double types. Converting weight to character goes as expected: The character representations of the numbers replace the numbers themselves. Converting the logical vector bp to double is pretty straightforward too: FALSE is converted to zero, and TRUE is converted to one. Now think about converting the character vector gender to a numeric double vector. It’s not at all clear how to represent “female” and “male” as numbers. In fact in this case what R does is to create a character vector, but with each element set to NA, which is the representation of missing data.24 Finally consider the code sum(bp). Now bp is a logical vector, but when R sees that we are asking to sum this logical vector, it automatically converts it to a numerical vector and then adds the zeros and ones representing FALSE and TRUE.

R also has functions to test whether a vector is of a particular type.

is.double(weight)
## [1] TRUE
is.character(weight)
## [1] FALSE
is.integer(weight.int)
## [1] TRUE
is.logical(bp)
## [1] TRUE

4.1.1.1 Coercion

Consider the following examples.

xx <- c(1, 2, 3, TRUE)
xx
## [1] 1 2 3 1
yy <- c(1, 2, 3, "dog")
yy
## [1] "1"   "2"   "3"   "dog"
zz <- c(TRUE, FALSE, "cat")
zz
## [1] "TRUE"  "FALSE" "cat"
weight+bp
## [1] 124 158 205 200 223 140 106 195

Vectors in R can only contain elements of one type. If more than one type is included in a c() function, R silently coerces the vector to be of one type. The examples illustrate the hierarchy—if any element is a character, then the whole vector is character. If some elements are numeric (either integer or double) and other elements are logical, the whole vector is numeric. Note what happened when R was asked to add the numeric vector weight to the logical vector bp. The logical vector was silently coerced to be numeric, so that FALSE became zero and TRUE became one, and then the two numeric vectors were added.

4.1.2 Accessing Specific Elements of Vectors

To access and possibly change specific elements of vectors, refer to the position of the element in square brackets. For example, weight[4] refers to the fourth element of the vector weight. Note that R starts the numbering of elements at 1, i.e., the first element of a vector x is x[1].

weight
## [1] 123 157 205 199 223 140 105 194
weight[5]
## [1] 223
weight[1:3]
## [1] 123 157 205
length(weight)
## [1] 8
weight[length(weight)]
## [1] 194
weight[]
## [1] 123 157 205 199 223 140 105 194
weight[3] <- 202
weight
## [1] 123 157 202 199 223 140 105 194

Note that including nothing in the square brackets results in the whole vector being returned.

Negative numbers in the square brackets tell R to omit the corresponding value. And a zero as a subscript returns nothing (more precisely, it returns a length zero vector of the appropriate type).

weight[-3]
## [1] 123 157 199 223 140 105 194
weight[-length(weight)]
## [1] 123 157 202 199 223 140 105
lessWeight <- weight[-c(1,3,5)]
lessWeight
## [1] 157 199 140 105 194
weight[0]
## numeric(0)
weight[c(0,2,1)]
## [1] 157 123
weight[c(-1, 2)]
## Error in weight[c(-1, 2)]: only 0's may be mixed with negative subscripts

Note that mixing zero and other nonzero subscripts is allowed, but mixing negative and positive subscripts is not allowed.

What about the (usual) case where we don’t know the positions of the elements we want? For example possibly we want the weights of all females in the data. Later we will learn how to subset using logical indices, which is a very powerful way to access desired elements of a vector.

4.1.3 Practice Problem

A bad programming technique that often plagues beginners is a technique called hardcoding. Consider the following simple vector containing data on the number of tree species found at ten different sites.

tree.sp <- c(10, 13, 15, 8, 2, 9, 10, 20, 9, 11)

Suppose we are interested in the second to last value of the data set. Since we know there are ten values in the data set, we do this as follows

tree.sp[10 - 1]
## [1] 9

This is an example of hardcoding. But what if we attempt to use the same code on a second vector of tree species data that only has six sites?

tree.sp <- c(8, 4, 3, 2, 19, 3)
tree.sp[10 - 1]
## [1] NA

That’s clearly not what we want. Fix this code so we can always extract the second to last value in the vector, regardless of the length of the vector.

4.2 Factors

Categorical variables can be represented as character vectors. In many cases this simple representation is sufficient. Consider, however, two other categorical variables, one representing age via categories youth, young adult, middle age, senior, and another representing income via categories lower, middle, and upper. Suppose that for the small health data set, all the people are either middle aged or senior citizens. If we just represented the variable via a character vector, there would be no way to know that there are two other categories, representing youth and young adults, which happen not to be present in the data set. And for the income variable, the character vector representation does not explicitly indicate that there is an ordering of the levels.

Factors in R provide a more sophisticated way to represent categorical variables. Factors explicitly contain all possible levels, and allow ordering of levels.

age <- c("middle age", "senior", "middle age", "senior",
         "senior", "senior", "senior", "middle age")
income <- c("lower", "lower", "upper", "middle", "upper",
            "lower", "lower", "middle")
age
## [1] "middle age" "senior"     "middle age" "senior"    
## [5] "senior"     "senior"     "senior"     "middle age"
income
## [1] "lower"  "lower"  "upper"  "middle" "upper" 
## [6] "lower"  "lower"  "middle"
age <- factor(age, levels=c("youth", "young adult", "middle age",
                            "senior"))
age
## [1] middle age senior     middle age senior    
## [5] senior     senior     senior     middle age
## Levels: youth young adult middle age senior
income <- factor(income, levels=c("lower", "middle", "upper"),
                 ordered = TRUE)
income
## [1] lower  lower  upper  middle upper  lower  lower 
## [8] middle
## Levels: lower < middle < upper

In the factor version of age the levels are explicitly listed, so it is clear that the two included levels are not all the possible levels. And in the factor version of income, the ordering is explicit.

In many cases the character vector representation of a categorical variable is sufficient and easier to work with. In this book, factors will not be used extensively. It is important to note that R often by default creates a factor when character data are read in, and sometimes it is necessary to use the argument stringsAsFactors = FALSE to explicitly tell R not to do this. This is shown later in the chapter when data frames are introduced.

4.3 Missing Data, Infinity, etc.

Most real-world data sets have variables where some observations are missing. In a longitudinal study participants may drop out. In a survey, participants may decide not to respond to certain questions. Statistical software should be able to represent missing data and to analyze data sets in which some data are missing.

In R, the value NA is used for a missing data value. Since missing values may occur in numeric, character, and other types of data, and since R requires that a vector contain only elements of one type, there are different types of NA values. Usually R determines the appropriate type of NA value automatically. It is worth noting that the default type for NA is logical, and that NA is NOT the same as the character string "NA".

missingCharacter <- c("dog", "cat", NA, "pig", NA, "horse")
missingCharacter
## [1] "dog"   "cat"   NA      "pig"   NA      "horse"
is.na(missingCharacter)
## [1] FALSE FALSE  TRUE FALSE  TRUE FALSE
missingCharacter <- c(missingCharacter, "NA")
missingCharacter
## [1] "dog"   "cat"   NA      "pig"   NA      "horse"
## [7] "NA"
is.na(missingCharacter)
## [1] FALSE FALSE  TRUE FALSE  TRUE FALSE FALSE
allMissing <- c(NA, NA, NA)
typeof(allMissing)
## [1] "logical"

How should missing data be treated in computations, such as finding the mean or standard deviation of a variable? One possibility is to return NA. Another is to remove the missing value(s) and then perform the computation.

> mean(c(1,2,3,NA,5))
## [1] NA
> mean(c(1,2,3,NA,5), na.rm=TRUE)
## [1] 2.75

As this example shows, the default behavior for the mean() function is to return NA. If removal of the missing values and then computing the mean is desired, the argument na.rm is set to TRUE. Different R functions have different default behaviors, and there are other possible actions. Consulting the help for a function provides the details.

4.3.1 Practice Problem

Collecting data is often a messy process resulting in multiple errors in the data. Consider the following small vector representing the weights of 10 adults in pounds.

my.weights <- c(150, 138, 289, 239, 12, 103, 310, 200, 218, 178)

As far as I know, it’s not possible for an adult to weigh 12 pounds, so that is most likely an error. Change this value to NA, and then find the standard deviation of the weights after removing the NA value.

4.3.2 Infinity and NaN

What happens if R code requests division by zero, or results in a number that is too large to be represented? Here are some examples.

> x <- 0:4
> x
## [1] 0 1 2 3 4
> 1/x
## [1]    Inf 1.0000 0.5000 0.3333 0.2500
> x/x
## [1] NaN   1   1   1   1
> y <- c(10, 1000, 10000)
> 2^y
## [1]  1.024e+03 1.072e+301        Inf

Inf and -Inf represent infinity and negative infinity (and numbers which are too large in magnitude to be represented as floating point numbers). NaN represents the result of a calculation where the result is undefined, such as dividing zero by zero. All of these are common to a variety of programming languages, including R.

4.4 Data Frames

Commonly, data is rectangular in form, with variables as columns and cases as rows. Continuing with the (contrived) data on weight, gender, and blood pressure medication, each of those variables would be a column of the data set, and each person’s measurements would be a row. In R, such data are represented as a data frame.

healthData <- data.frame(Weight = weight, Gender=gender, 
             bp.meds = bp,
                         stringsAsFactors=FALSE)
healthData
##   Weight Gender bp.meds
## 1    123 female    TRUE
## 2    157 female    TRUE
## 3    202   male   FALSE
## 4    199 female    TRUE
## 5    223   male   FALSE
## 6    140   male   FALSE
## 7    105 female    TRUE
## 8    194   male    TRUE
names(healthData)
## [1] "Weight"  "Gender"  "bp.meds"
colnames(healthData)
## [1] "Weight"  "Gender"  "bp.meds"
names(healthData) <- c("Wt", "Gdr", "bp")
healthData
##    Wt    Gdr    bp
## 1 123 female  TRUE
## 2 157 female  TRUE
## 3 202   male FALSE
## 4 199 female  TRUE
## 5 223   male FALSE
## 6 140   male FALSE
## 7 105 female  TRUE
## 8 194   male  TRUE
rownames(healthData)
## [1] "1" "2" "3" "4" "5" "6" "7" "8"
names(healthData) <- c("Weight", "Gender", "bp.meds")

The data.frame function can be used to create a data frame (although it’s more common to read a data frame into R from an external file, something that will be introduced later). The names of the variables in the data frame are given as arguments, as are the vectors of data that make up the variable’s values. The argument stringsAsFactors=FALSE asks R not to convert character vectors into factors. As of version R 4.0.0, R does not automatically convert character vectors into factors. However, up until this recent version, R would automatically convert strings to factors (i.e., stringsAsFactors = TRUE), and so to avoid confusion we will typically display stringsAsFactors=FALSE throughout most of the book. Names of the columns (variables) can be extracted and set via either names or colnames. In the example, the variable names are changed to Wt, Gdr, bp and then changed back to the original Weight, Gender, bp.meds in this way. Rows can be named also. In this case since specific row names were not provided, the default row names of "1", "2" etc. are used.

In the next example a built-in dataset called mtcars is made available by the data function, and then the first and last six rows are displayed using head and tail.

data(mtcars)
head(mtcars)
##                    mpg cyl disp  hp drat    wt  qsec
## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46
## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02
## Datsun 710        22.8   4  108  93 3.85 2.320 18.61
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44
## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02
## Valiant           18.1   6  225 105 2.76 3.460 20.22
##                   vs am gear carb
## Mazda RX4          0  1    4    4
## Mazda RX4 Wag      0  1    4    4
## Datsun 710         1  1    4    1
## Hornet 4 Drive     1  0    3    1
## Hornet Sportabout  0  0    3    2
## Valiant            1  0    3    1
tail(mtcars)
##                 mpg cyl  disp  hp drat    wt qsec vs
## Porsche 914-2  26.0   4 120.3  91 4.43 2.140 16.7  0
## Lotus Europa   30.4   4  95.1 113 3.77 1.513 16.9  1
## Ford Pantera L 15.8   8 351.0 264 4.22 3.170 14.5  0
## Ferrari Dino   19.7   6 145.0 175 3.62 2.770 15.5  0
## Maserati Bora  15.0   8 301.0 335 3.54 3.570 14.6  0
## Volvo 142E     21.4   4 121.0 109 4.11 2.780 18.6  1
##                am gear carb
## Porsche 914-2   1    5    2
## Lotus Europa    1    5    2
## Ford Pantera L  1    5    4
## Ferrari Dino    1    5    6
## Maserati Bora   1    5    8
## Volvo 142E      1    4    2

Note that the mtcars data frame does have non-default row names which give the make and model of the cars.

4.4.1 Accessing Specific Elements of Data Frames

Data frames are two-dimensional, so to access a specific element (or elements) we need to specify both the row and column.

mtcars[1,4]
## [1] 110
mtcars[1:3, 3]
## [1] 160 160 108
mtcars[1:3, 2:3]
##               cyl disp
## Mazda RX4       6  160
## Mazda RX4 Wag   6  160
## Datsun 710      4  108
mtcars[,1]
##  [1] 21.0 21.0 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2
## [11] 17.8 16.4 17.3 15.2 10.4 10.4 14.7 32.4 30.4 33.9
## [21] 21.5 15.5 15.2 13.3 19.2 27.3 26.0 30.4 15.8 19.7
## [31] 15.0 21.4

Note that mtcars[,1] returns ALL elements in the first column. This agrees with the behavior for vectors, where leaving a subscript out of the square brackets tells R to return all values. In this case we are telling R to return all rows, and the first column.

For a data frame there is another way to access specific columns, using the $ notation.

> mtcars$mpg
##  [1] 21.0 21.0 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2
## [11] 17.8 16.4 17.3 15.2 10.4 10.4 14.7 32.4 30.4 33.9
## [21] 21.5 15.5 15.2 13.3 19.2 27.3 26.0 30.4 15.8 19.7
## [31] 15.0 21.4
> mtcars$cyl
##  [1] 6 6 4 6 8 6 8 4 4 6 6 8 8 8 8 8 8 4 4 4 4 8 8 8 8
## [26] 4 4 4 8 6 8 4
> mpg
## Error in eval(expr, envir, enclos): object 'mpg' not found
> cyl
## Error in eval(expr, envir, enclos): object 'cyl' not found
> weight
## [1] 123 157 202 199 223 140 105 194

Notice that typing the variable name, such as mpg, without the name of the data frame (and a dollar sign) as a prefix, does not work. This is sensible. There may be several data frames that have variables named mpg, and just typing mpg doesn’t provide enough information to know which is desired. But if there is a vector named mpg that is created outside a data frame, it will be retrieved when mpg is typed, which is why typing weight does work, since weight was created outside of a data frame, although ultimately it was incorporated into the healthData data frame.

4.5 Lists

The third main data structure we will work with is a list. Technically a list is a vector, but one in which elements can be of different types. For example a list may have one element that is a vector, one element that is a data frame, and another element that is a function. Consider designing a function that fits a simple linear regression model to two quantitative variables. We might want that function to compute and return several things such as

  • The fitted slope and intercept (a numeric vector with two components)
  • The residuals (a numeric vector with \(n\) components, where \(n\) is the number of data points)
  • Fitted values for the data (a numeric vector with \(n\) components, where \(n\) is the number of data points)
  • The names of the dependent and independent variables (a character vector with two components)

In fact R has a function, lm, which does this (and much more).

mpgHpLinMod <- lm(mpg ~ hp, data=mtcars)
mode(mpgHpLinMod)
## [1] "list"
names(mpgHpLinMod)
##  [1] "coefficients"  "residuals"     "effects"      
##  [4] "rank"          "fitted.values" "assign"       
##  [7] "qr"            "df.residual"   "xlevels"      
## [10] "call"          "terms"         "model"
mpgHpLinMod$coefficients
## (Intercept)          hp 
##    30.09886    -0.06823
mpgHpLinMod$residuals
##           Mazda RX4       Mazda RX4 Wag 
##            -1.59375            -1.59375 
##          Datsun 710      Hornet 4 Drive 
##            -0.95363            -1.19375 
##   Hornet Sportabout             Valiant 
##             0.54109            -4.83489 
##          Duster 360           Merc 240D 
##             0.91707            -1.46871 
##            Merc 230            Merc 280 
##            -0.81717            -2.50678 
##           Merc 280C          Merc 450SE 
##            -3.90678            -1.41777 
##          Merc 450SL         Merc 450SLC 
##            -0.51777            -2.61777 
##  Cadillac Fleetwood Lincoln Continental 
##            -5.71206            -5.02978 
##   Chrysler Imperial            Fiat 128 
##             0.29364             6.80421 
##         Honda Civic      Toyota Corolla 
##             3.84901             8.23598 
##       Toyota Corona    Dodge Challenger 
##            -1.98072            -4.36462 
##         AMC Javelin          Camaro Z28 
##            -4.66462            -0.08293 
##    Pontiac Firebird           Fiat X1-9 
##             1.04109             1.70421 
##       Porsche 914-2        Lotus Europa 
##             2.10991             8.01093 
##      Ford Pantera L        Ferrari Dino 
##             3.71340             1.54109 
##       Maserati Bora          Volvo 142E 
##             7.75761            -1.26198

The lm function returns a list (which in the code above has been assigned to the object mpgHpLinMod).25 One component of the list is the length 2 vector of coefficients, while another component is the length 32 vector of residuals. The code also illustrates that named components of a list can be accessed using the dollar sign notation, as with data frames.

The list function is used to create lists.

temporaryList <- list(first=weight, second=healthData,
                      pickle=list(a = 1:10, b=healthData))
temporaryList
## $first
## [1] 123 157 202 199 223 140 105 194
## 
## $second
##   Weight Gender bp.meds
## 1    123 female    TRUE
## 2    157 female    TRUE
## 3    202   male   FALSE
## 4    199 female    TRUE
## 5    223   male   FALSE
## 6    140   male   FALSE
## 7    105 female    TRUE
## 8    194   male    TRUE
## 
## $pickle
## $pickle$a
##  [1]  1  2  3  4  5  6  7  8  9 10
## 
## $pickle$b
##   Weight Gender bp.meds
## 1    123 female    TRUE
## 2    157 female    TRUE
## 3    202   male   FALSE
## 4    199 female    TRUE
## 5    223   male   FALSE
## 6    140   male   FALSE
## 7    105 female    TRUE
## 8    194   male    TRUE

Here, for illustration, I assembled a list to hold some of the R data structures we have been working with in this chapter. The first list element, named first, holds the weight vector we created in Section 4.1, the second list element, named second, holds the healthData data frame, and the third list element, named pickle, holds a list with elements named a and b that hold a vector of values 1 through 10 and another copy of the healthData data frame, respectively. As this example shows, a list can contain another list.

4.5.1 Accessing Specific Elements of Lists

We already have seen the dollar sign notation works for lists. In addition, the square bracket subsetting notation can be used. There is an added, somewhat subtle wrinkle—using either single or double square brackets.

temporaryList$first
## [1] 123 157 202 199 223 140 105 194
mode(temporaryList$first)
## [1] "numeric"
temporaryList[[1]]
## [1] 123 157 202 199 223 140 105 194
mode(temporaryList[[1]])
## [1] "numeric"
temporaryList[1]
## $first
## [1] 123 157 202 199 223 140 105 194
mode(temporaryList[1])
## [1] "list"

Note the dollar sign and double bracket notation return a numeric vector, while the single bracket notation returns a list. Notice also the difference in results below.

temporaryList[c(1,2)]
## $first
## [1] 123 157 202 199 223 140 105 194
## 
## $second
##   Weight Gender bp.meds
## 1    123 female    TRUE
## 2    157 female    TRUE
## 3    202   male   FALSE
## 4    199 female    TRUE
## 5    223   male   FALSE
## 6    140   male   FALSE
## 7    105 female    TRUE
## 8    194   male    TRUE
temporaryList[[c(1,2)]]
## [1] 157

The single bracket form returns the first and second elements of the list, while the double bracket form returns the second element in the first element of the list. Generally, do not put a vector of indices or names in a double bracket, you will likely get unexpected results. See, for example, the results below.26

temporaryList[[c(1,2,3)]]
## Error in temporaryList[[c(1, 2, 3)]]: recursive indexing failed at level 2

So, in summary, there are two main differences between using the single bracket [] and double bracket [[]]. First, the single bracket will return a list that holds the object(s) held at the given indices or names placed in the bracket, whereas the double brackets will return the actual object held at the index or name placed in the innermost bracket. Put differently, a single bracket can be used to access a range of list elements and will return a list, and a double bracket can only access a single element in the list and will return the object held at the index.

4.6 Comparison and logical operators

Comparison operators are binary operators that test a comparative condition between the operands and return a logical value to indicate the test result. We often use comparison operators to gain access to only part of an R object that passes some logical test. You’re likely already familiar with many comparison operators.

The basic idea of comparison operators is quite simple. We have a logical test (e.g., what weights are greater than 200) and want to determine what values in a vector (or some other R object) pass the test. When we apply a comparison operator, the results are logical values that indicate whether or not the specific element in the vector passes the test (TRUE) or not (FALSE).

Let’s walk through the comparison operators available in R. We’ll present the operator and its definition, followed by an example using the weight and gender vectors created in Section 4.1. First, let’s recall the values held in these vectors.

weight
## [1] 123 157 202 199 223 140 105 194
gender
## [1] "female" "female" "male"   "female" "male"  
## [6] "male"   "female" "male"
  1. == the equality operator: The “double equals sign” tests if operands are equal. Below we perform a logical test to determine which gender vector elements equal male.
gender == "male"
## [1] FALSE FALSE  TRUE FALSE  TRUE  TRUE FALSE  TRUE

Not surprisingly, the third, fifth, sixth, and eigth elements return TRUE and all other elements return FALSE. Notice we’re using the == sign, not the = sign. Mixing up the comparison operator == and assignment operator = is a common error.

  1. != the inequality operator: Tests if operands are not equal, and is thus the inverse of ==. We see this by testing which gender vector elements do not equal male.
gender != "male"
## [1]  TRUE  TRUE FALSE  TRUE FALSE FALSE  TRUE FALSE
  1. <, <=, >, >= less than, less than or equal to, greater than, and greater than or equal to operators, respectively. Using the weights vector, determine which elements are greater than 194 and then greater than or equal to 194.
weight > 194
## [1] FALSE FALSE  TRUE  TRUE  TRUE FALSE FALSE FALSE
weight >= 194
## [1] FALSE FALSE  TRUE  TRUE  TRUE FALSE FALSE  TRUE

Suppose we want to know which weight vector elements are greater than 194 and less than 210. Answering this question requires use of two comparison operators, i.e., \(<\) and \(>\). In such cases, logical operators are used to combine multiple comparison operations into a single logical statement. We consider the following logical operators “and”, “or”, “xor”, and “negation”.

Importantly, in order of operation, comparison operators precede logical operators. The Syntax manual page (i.e., run ?Syntax on the Console) lists R operators’ order of operation, where you’ll notice the comparison operators are listed before the logical operators in the precedence groups under the Details Section.

Let’s walk through each of the logical operators:

  1. & the “and” operator: A comparison using the & operator returns TRUE when both operands are TRUE and FALSE otherwise. The & operator works elementwise for operand vectors. Consider the following example.
weight < 210
## [1]  TRUE  TRUE  TRUE  TRUE FALSE  TRUE  TRUE  TRUE
weight > 194
## [1] FALSE FALSE  TRUE  TRUE  TRUE FALSE FALSE FALSE
weight < 210 & weight > 194
## [1] FALSE FALSE  TRUE  TRUE FALSE FALSE FALSE FALSE

First we show the results of weight < 210 and weight > 194 separately. When combining the comparison operations using the & operator, R first performs weight < 210 and weight > 194, then applies & elementwise on the logical vector operands. The elementwise & returns TRUE when the element in the weight < 210 vector is TRUE and the element in the weight > 194 vector is TRUE. The key point to remember is that & returns TRUE only if both operands are TURE.

  1. | the “or” operator: A comparison using the | operator returns TRUE if at least one operand is TRUE and FALSE otherwise. Similar to the & operator, the | operator works element by element. Let’s use the same example as before, but now we’ll return individuals with a weight less than 210 or a weight greater than 194.
weight < 210 | weight > 194
## [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE

Not surprisingly, this operation returns TRUE for all elements, because all elements in weight are either greater than 194 or less than 210.

  1. xor the “exclusive or” operator: A comparison using the xor operator returns TRUE if one of the operands is TRUE and FALSE otherwise.
xor(weight < 210, weight > 194)
## [1]  TRUE  TRUE FALSE FALSE  TRUE  TRUE  TRUE  TRUE

While we can imagine cases where this operator would be handy, we’ve never found the occasion to use it in our own code.

  1. ! the “negation” or “not” operator: The exclamation point ! (called “bang” in programmer’s slang) reverses a logical value, i.e. !TRUE is FALSE and !FALSE is TRUE. The code below returns TRUE for weight values not greater than 194 (while not required, the parentheses emphasize the order of operation).
!(weight > 194)
## [1]  TRUE  TRUE FALSE FALSE FALSE  TRUE  TRUE  TRUE

There is a “&&” and “||” variant of “&” and “|”, respectively. These “double” operators examine only the first element of operand vectors in a comparison rather than comparing element by element. There are a few cases where using && and || are useful when writing conditional statements in functions (see, e.g., Chapter 7), however, we’ll generally not use them in this book.

4.6.1 The %in% operator

Suppose we want to identify the weight vector elements equal to 123, 199, or 140. We can do this using the equality operator == and the | operator as follows.

weight
## [1] 123 157 202 199 223 140 105 194
weight == 123 | weight == 199 | weight == 140
## [1]  TRUE FALSE FALSE  TRUE FALSE  TRUE FALSE FALSE

However, this is a little clunky, involves a lot of typing, and generally makes code hard to read. Lucky for us, R has the “in” operator, %in%, to accomplish this task in a more intuitive and easy-to-read manner.

weight %in% c(123, 199, 140)
## [1]  TRUE FALSE FALSE  TRUE FALSE  TRUE FALSE FALSE

In the spirit of coding techniques to promote efficient and reproducible code, we’ll use the %in% operator throughout the book.

Comparison and logical operators are invaluable to identify subsets of data that meet specified conditions. The next section explores how conditional and logical operators facilitate subsetting vectors, data frames, and lists.

4.7 Subsetting with Logical Vectors

Consider the healthData data frame. How can we access only those weights which are more than 200? How can we access the genders of those whose weights are more than 200? How can we compute the mean weight of males and the mean weight of females? Or consider the mtcars data frame. How can we obtain the miles per gallon for all six cylinder cars? Both of these data sets are small enough that it would not be too onerous to extract the values by hand. But for larger or more complex data sets, this would be very difficult or impossible to do in a reasonable amount of time, and would likely result in errors.

R has a powerful method for solving these sorts of problems using a variant of the subsetting methods that we already have learned. When given a logical vector in square brackets, R will return the values corresponding to TRUE. To begin, focus on the weight and gender vectors created in Section 4.1.

The R code weight > 200 returns a TRUE for each value of weight which is more than 200, and a FALSE for each value of weight which is less than or equal to 200. Similarly gender == "female" returns TRUE or FALSE depending on whether an element of gender is equal to female.

weight
## [1] 123 157 202 199 223 140 105 194
weight > 200
## [1] FALSE FALSE  TRUE FALSE  TRUE FALSE FALSE FALSE
gender[weight > 200]
## [1] "male" "male"
weight[weight > 200]
## [1] 202 223
gender == "female"
## [1]  TRUE  TRUE FALSE  TRUE FALSE FALSE  TRUE FALSE
weight[gender == "female"]
## [1] 123 157 199 105

Consider the lines of R code one by one.

  • weight instructs R to display the values in the vector weight.
  • weight > 200 instructs R to check whether each value in weight is greater than 200, and to return TRUE if so, and FALSE otherwise.
  • The next line, gender[weight > 200], does two things. First, inside the square brackets, it does the same thing as the second line, namely, returning TRUE or FALSE depending on whether a value of weight is or is not greater than 200. Second, each element of gender is matched with the corresponding TRUE or FALSE value, and is returned if and only if the corresponding value is TRUE. For example the first value of gender is gender[1]. Since the first TRUE or FALSE value is FALSE, the first value of gender is not returned. Only the third and fifth values of gender, both of which happen to be male, are returned. Briefly, this line returns the genders of those people whose weight is over 200 pounds.
  • The fourth line of code, weight[weight > 200], again begins by returning TRUE or FALSE depending on whether elements of weight are larger than 200. Then those elements of weight corresponding to TRUE values, are returned. So this line returns the weights of those people whose weights are more than 200 pounds.
  • The fifth line returns TRUE or FALSE depending on whether elements of gender are equal to female or not.
  • The sixth line returns the weights of those whose gender is female.

4.7.1 Modifying or Creating Objects via Subsetting

The results of subsetting can be assigned to a new (or existing) R object, and subsetting on the left side of an assignment is a common way to modify an existing R object.

weight
## [1] 123 157 202 199 223 140 105 194
light.weight <- weight[weight < 200]
light.weight
## [1] 123 157 199 140 105 194
x <- 1:10
x
##  [1]  1  2  3  4  5  6  7  8  9 10
x[x < 5] <- 0
x
##  [1]  0  0  0  0  5  6  7  8  9 10
y <- -3:9
y
##  [1] -3 -2 -1  0  1  2  3  4  5  6  7  8  9
y[y < 0] <- NA
y
##  [1] NA NA NA  0  1  2  3  4  5  6  7  8  9
rm(x)
rm(y)

4.7.2 Logical Subsetting and Data Frames

First consider the small and simple healthData data frame.

healthData
##   Weight Gender bp.meds
## 1    123 female    TRUE
## 2    157 female    TRUE
## 3    202   male   FALSE
## 4    199 female    TRUE
## 5    223   male   FALSE
## 6    140   male   FALSE
## 7    105 female    TRUE
## 8    194   male    TRUE
healthData$Weight[healthData$Gender == "male"]
## [1] 202 223 140 194
healthData[healthData$Gender == "female", ]
##   Weight Gender bp.meds
## 1    123 female    TRUE
## 2    157 female    TRUE
## 4    199 female    TRUE
## 7    105 female    TRUE
healthData[healthData$Weight > 190, 2:3]
##   Gender bp.meds
## 3   male   FALSE
## 4 female    TRUE
## 5   male   FALSE
## 8   male    TRUE

The first example is really just subsetting a vector, since the $ notation creates vectors. The second two examples return subsets of the whole data frame. Note that the logical vector subsets the rows of the data frame, choosing those rows where the gender is female or the weight is more than 190. Note also that the specification for the columns (after the comma) is left blank in the first case, telling R to return all the columns. In the second case the second and third columns are requested explicitly.

Next consider the much larger and more complex WorldBank data frame. Recall, the str function displays the “structure” of an R object. Here is a look at the structure of several R objects.

str(mtcars)
## 'data.frame':    32 obs. of  11 variables:
##  $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
##  $ cyl : num  6 6 4 6 8 6 8 4 4 6 ...
##  $ disp: num  160 160 108 258 360 ...
##  $ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
##  $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
##  $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
##  $ qsec: num  16.5 17 18.6 19.4 17 ...
##  $ vs  : num  0 0 1 1 0 1 0 1 1 1 ...
##  $ am  : num  1 1 1 0 0 0 0 0 0 0 ...
##  $ gear: num  4 4 4 3 3 3 3 4 4 4 ...
##  $ carb: num  4 4 1 1 2 1 4 2 2 4 ...
str(temporaryList)
## List of 3
##  $ first : num [1:8] 123 157 202 199 223 140 105 194
##  $ second:'data.frame':  8 obs. of  3 variables:
##   ..$ Weight : num [1:8] 123 157 202 199 223 140 105 194
##   ..$ Gender : chr [1:8] "female" "female" "male" "female" ...
##   ..$ bp.meds: logi [1:8] TRUE TRUE FALSE TRUE FALSE FALSE ...
##  $ pickle:List of 2
##   ..$ a: int [1:10] 1 2 3 4 5 6 7 8 9 10
##   ..$ b:'data.frame':    8 obs. of  3 variables:
##   .. ..$ Weight : num [1:8] 123 157 202 199 223 140 105 194
##   .. ..$ Gender : chr [1:8] "female" "female" "male" "female" ...
##   .. ..$ bp.meds: logi [1:8] TRUE TRUE FALSE TRUE FALSE FALSE ...
str(WorldBank)
## 'data.frame':    11880 obs. of  15 variables:
##  $ iso2c                       : chr  "AD" "AD" "AD" "AD" ...
##  $ country                     : chr  "Andorra" "Andorra" "Andorra" "Andorra" ...
##  $ year                        : int  1978 1979 1977 2007 1976 2011 2012 2008 1980 1972 ...
##  $ fertility.rate              : num  NA NA NA 1.18 NA NA NA 1.25 NA NA ...
##  $ life.expectancy             : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ population                  : num  33746 34819 32769 81292 31781 ...
##  $ GDP.per.capita.Current.USD  : num  9128 11820 7751 39923 7152 ...
##  $ X15.to.25.yr.female.literacy: num  NA NA NA NA NA NA NA NA NA NA ...
##  $ iso3c                       : chr  "AND" "AND" "AND" "AND" ...
##  $ region                      : chr  "Europe & Central Asia (all income levels)" "Europe & Central Asia (all income levels)" "Europe & Central Asia (all income levels)" "Europe & Central Asia (all income levels)" ...
##  $ capital                     : chr  "Andorra la Vella" "Andorra la Vella" "Andorra la Vella" "Andorra la Vella" ...
##  $ longitude                   : num  1.52 1.52 1.52 1.52 1.52 ...
##  $ latitude                    : num  42.5 42.5 42.5 42.5 42.5 ...
##  $ income                      : chr  "High income: nonOECD" "High income: nonOECD" "High income: nonOECD" "High income: nonOECD" ...
##  $ lending                     : chr  "Not classified" "Not classified" "Not classified" "Not classified" ...

First we see that mtcars is a data frame which has 32 observations (rows) on each of 11 variables (columns). The names of the variables are given, along with their type (in this case, all numeric), and the first few values of each variable is given.

Second we see that temporaryList is a list with three components. Each of the components is described separately, with the first few values again given.

Third we examine the structure of WorldBank. It is a data frame with 11880 observations on each of 15 variables. Some of these are character variables, some are numeric, and one (year) is integer. Looking at the first few values we see that some variables have missing values.

Consider creating a data frame which only has the observations from one year, say 1971. That’s relatively easy. Just choose rows for which year is equal to 1971.

WorldBank1971 <- WorldBank[WorldBank$year == 1971, ]
dim(WorldBank1971)
## [1] 216  15

The dim function returns the dimensions of a data frame, i.e., the number of rows and the number of columns. From dim we see that there are dim(WorldBank1971)[1] cases from 1971.

Next, how can we create a data frame which only contains data from 1971, and also only contains cases for which there are no missing values in the fertility rate variable? R has a built in function is.na which returns TRUE if the observation is missing and returns FALSE otherwise. And !is.na returns the negation, i.e., it returns FALSE if the observation is missing and TRUE if the observation is not missing.

WorldBank1971$fertility.rate[1:25]
##  [1]    NA 6.512 7.671 3.517 4.933 3.118 7.264 3.104
##  [9]    NA 2.200 2.961 2.788 4.479 2.260 2.775 2.949
## [17] 6.942 2.210 6.657 2.100 6.293 7.329 6.786    NA
## [25] 5.771
!is.na(WorldBank1971$fertility.rate[1:25])
##  [1] FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
##  [9] FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## [17]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE
## [25]  TRUE
WorldBank1971 <- WorldBank1971[!is.na(WorldBank1971$fertility.rate),]
dim(WorldBank1971)
## [1] 193  15

From dim we see that there are 193 cases from 1971 with non-missing fertility rate data.

Return attention now to the original WorldBank data frame with data not only from 1971. How can we extract only those cases (rows) which have NO missing data? Consider the following simple example:

temporaryDataFrame <- data.frame(V1 = c(1, 2, 3, 4, NA),
                                 V2 = c(NA, 1, 4, 5, NA),
                                 V3 = c(1, 2, 3, 5, 7))
temporaryDataFrame
##   V1 V2 V3
## 1  1 NA  1
## 2  2  1  2
## 3  3  4  3
## 4  4  5  5
## 5 NA NA  7
is.na(temporaryDataFrame)
##         V1    V2    V3
## [1,] FALSE  TRUE FALSE
## [2,] FALSE FALSE FALSE
## [3,] FALSE FALSE FALSE
## [4,] FALSE FALSE FALSE
## [5,]  TRUE  TRUE FALSE
rowSums(is.na(temporaryDataFrame))
## [1] 1 0 0 0 2

First notice that is.na will test each element of a data frame for missingness. Also recall that if R is asked to sum a logical vector, it will first convert the logical vector to numeric and then compute the sum, which effectively counts the number of elements in the logical vector which are TRUE. The rowSums function computes the sum of each row. So rowSums(is.na(temporaryDataFrame)) returns a vector with as many elements as there are rows in the data frame. If an element is zero, the corresponding row has no missing values. If an element is greater than zero, the value is the number of variables which are missing in that row. This gives a simple method to return all the cases which have no missing data.

dim(WorldBank)
## [1] 11880    15
WorldBankComplete <- WorldBank[rowSums(is.na(WorldBank)) == 0,]
dim(WorldBankComplete)
## [1] 564  15

Out of the 564 rows in the original data frame, only 564 have no missing observations!

4.8 Patterned Data

Sometimes it is useful to generate all the integers from 1 through 20, to generate a sequence of 100 points equally spaced between 0 and 1, etc. The R functions seq() and rep() as well as the “colon operator” : help to generate such sequences.

The colon operator generates a sequence of values with increments of \(1\) or \(-1\).

1:10
##  [1]  1  2  3  4  5  6  7  8  9 10
-5:3
## [1] -5 -4 -3 -2 -1  0  1  2  3
10:4
## [1] 10  9  8  7  6  5  4
pi:7
## [1] 3.142 4.142 5.142 6.142

The seq() function generates either a sequence of pre-specified length or a sequence with pre-specified increments.

seq(from = 0, to = 1, length = 11)
##  [1] 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
seq(from = 1, to = 5, by = 1/3)
##  [1] 1.000 1.333 1.667 2.000 2.333 2.667 3.000 3.333
##  [9] 3.667 4.000 4.333 4.667 5.000
seq(from = 3, to = -1, length = 10)
##  [1]  3.0000  2.5556  2.1111  1.6667  1.2222  0.7778
##  [7]  0.3333 -0.1111 -0.5556 -1.0000

The rep() function replicates the values in a given vector.

rep(c(1,2,4), length = 9)
## [1] 1 2 4 1 2 4 1 2 4
rep(c(1,2,4), times = 3)
## [1] 1 2 4 1 2 4 1 2 4
rep(c("a", "b", "c"), times = c(3, 2, 7))
##  [1] "a" "a" "a" "b" "b" "c" "c" "c" "c" "c" "c" "c"

4.8.1 Practice Problem

Often when using R you will want to simulate data from a specific probability distribution (i.e. normal/Gaussian, bionmial, Poisson). R has a vast suite of functions for working with statistical distributions. To generate values from a statistical distribution, the function has a name beginning with an “r” followed by some abbreviation of the probability distribution. For example to simulate from the three distributions mentioned above, we can use the functions rnorm(), rbinom(), and rpois().

Use the rnorm() function to generate 10,000 values from the standard normal distribution (the normal distribution with mean = 0 and variance = 1). Consult the help page for rnorm() if you need to. Save this vector of variables to a vector named sim.vals. Then use the hist() function to draw a histogram of the simulated data. Does the data look like it follows a normal distribution?

4.9 Exercises

Exercise 3 Learning objectives: create, subset, and manipulate vector contents and attributes; summarize vector data using R table() and other functions; generate basic graphics using vector data.

Exercise 4 Learning objectives: use functions to describe data frame characteristics; summarize and generate basic graphics for variables held in data frames; apply the subset function with logical operators; illustrate NA, NaN, Inf, and other special values; recognize the implications of using floating point arithmetic with logical operators.

Exercise 5 Learning objectives: practice with lists, data frames, and associated functions; summarize variables held in lists and data frames; work with R’s linear regression lm() function output; review logical subsetting of vectors for partitioning and assigning of new values; generate and visualize data from mathematical functions.


  1. Technically the objects described in this section are “atomic” vectors (all elements of the same type), since lists, to be described below, also are actually vectors. This will not be an important issue, and the shorter term vector will be used for atomic vectors below.↩︎

  2. Missing data will be discussed in more detail later in the chapter.↩︎

  3. The mode function returns the type or storage mode of an object.↩︎

  4. Try this example using only single brackets\(\ldots\) it will return a list holding elements first, second, and pickle.↩︎