5 Graphics in R Part 1: ggplot2

R can be used to create a vast array of graphical representations of data. Creating “standard” graphical displays is straightforward, but a main strength of R is the ability to customize graphical displays to create either non-standard graphics or to modify more standard graphical displays to create publication-ready versions.

There are several packages available in R for creating graphics. The two leading packages are the graphics package, which comes with your base installation of R, and the ggplot2 package, which must be installed and made available by the user.25 For beginners ggplot2 has somewhat simpler syntax, and also produces excellent graphics without much tinkering. However, the graphics package seemingly provides more control over different graphical parameters, and can sometimes be more intuitive than ggplot2.

Knowing how to use both the graphics and ggplot2 packages is worthwhile, so we denote one chapter to ggplot2 and a second chapter to graphics. We use ggplot2 throughout the text, and thus require you to read this chapter on ggplot2 while Chapter ?? on the graphics package is optional. You’ll notice that the chapters are essentially the same, producing the same graphs (with a few exceptions) to let you decide which graphing style you prefer. We simply want to present you with both sets of tools so that in the future when you have graphs to produce you can use whichever package floats your boat! For now we’ll start off with ggplot2 and get to graphics in Chapter ??.

The gg in ggplot2 stands for Grammar of Graphics. The package provides a unified and logical way to describe graphical displays such as scatter plots, histograms, bar charts, and many other types of graphics. The grammar describes the mapping from data to the graphical display’s aesthetic attributes (color, shape, size) of geometric objects (points, lines, bars). As will become obvious, once this grammar is mastered for a particular type of plot, such as a scatter plot, it is easy to transfer this knowledge to other types of graphics.

Once you work through this chapter, the best place to learn more about ggplot2 is from the package’s official book Wickham and Sievert (2016) by Hadley Wickham. It is available on-line in digital format from MSU’s library. The book goes into much more depth on the theory underlying the grammar and syntax, and has many examples on solving practical graphical problems. In addition to the free on-line version available through MSU, the book’s source code is available at https://github.com/hadley/ggplot2-book.

Another useful resource is the ggplot2 extensions guide http://www.ggplot2-exts.org. This site lists packages that extend ggplot2. It’s a good place to start if you’re trying to do something that seems hard with ggplot2. We’ll explore a few of these extension packages toward the end of this chapter.

5.1 Scatter Plots

Scatter plots are a workhorse of data visualization and provide a good entry point to the ggplot2 system. Begin by considering a simple and classic data set sometimes called Fisher’s Iris Data. These data are available in R.

> data(iris)
> str(iris)
'data.frame':   150 obs. of  5 variables:
 $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
 $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
 $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
 $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
 $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...

The data contain measurements on petal and sepal length and width for dim(iris)[1] iris plants. The plants are from one of three species, and the species information is also included in the data frame. The data are commonly used to test classification methods, where the goal would be to correctly determine the species based on the four length and width measurements. To get a preliminary sense of how this might work, we can draw some scatter plots of length versus width. Recall that ggplot2 is not available by default, so we first have to download and install the package.

> install.packages("ggplot2")

Once this is done the package is installed on the local hard drive, and we can use the library function to make the package available during the current R session.

Next a basic scatter plot is drawn. We’ll keep the focus on sepal length and width, but of course similar plots could be drawn using petal length and width. The prompt is not displayed below, since the continuation prompt + can cause confusion.

library(ggplot2)
ggplot(data = iris, aes(x = Sepal.Length, y = Sepal.Width)) + 
    geom_point()

In this case the first argument to the ggplot function is the name of the data frame. Second, the aes (short for aesthetics) function specifies the mapping to the x and y axes. By itself the ggplot function as written doesn’t tell R what sort of graphical display is desired. That is done by adding a geom (short for geometry) specification, in this case geom_point.

Looking at the scatter plot and thinking about the focus of finding a method to classify the species, two thoughts come to mind. First, the plot might be improved by increasing the size of the points. And second, using different colors for the points corresponding to the three species would help.

ggplot(data = iris, aes(x = Sepal.Length, y = Sepal.Width)) + 
    geom_point(size = 4, aes(color=Species))