Chapter 13 Databases and R

As we previously saw in Chapter 6, dplyr is a fantastic tool for manipulating data inside the R environment. Up to this point, we have worked with data sets that easily fit within your computer’s memory. Is R capable of working with data sets that are too large to store on your computer? Fortunately R is more than capable of dealing with large data sets (or not so large data sets that are stored remotely in a database). Recently, developers at RStudio have been working on building tools for data scientists to use inside of R to work with databases. Currently there are numerous different ways to work with data from databases in R. In this chapter we will explore some of the many different packages used to query external databases from within R, and we will focus on using dplyr to perform our queries solely in R.

13.1 SQL and Database Structure

In today’s world, there is an extremely large amount of electronic data. The actual amount of existing data is not quite known, as it is pretty much impossible to determine an exact amount. One estimate by IBM stated that 2.5 exabytes (2.5 billion gigabytes) of data were generated every day in 2012. Clearly, we need to have efficient ways to store such incredibly large amounts of data inside computers. In many cases, these data are stored in relational databases.

In simple terms, a database is a collection of similar files. In a relational database, there are numerous data-containing tables that are related to each other by some common field, known as a key. Each table consists of numerous rows and columns, very similar to the way a tibble or any other type of data frame is stored inside of R. One can imagine a database consisting of student records with two tables, one containing the address information of students (called address), and one containing email information of the students (called email). These two tables could perhaps be linked (or joined) together by a common field such as a student identification number. Thus one could join together the two tables to obtain email information and address information for any given student at the same time.

Data analysts and data scientists require a method to do tasks like that described above in databases. Fortunately, almost every relational database is manipulated and stored using SQL, which makes such queries possible.

13.1.1 SQL

SQL62, or Structured Query Language, is an incredibly useful tool for querying and managing relational databases. SQL is common to all major database management systems, such as Oracle, MariaDB, PostgreSQL, MySQL, SQLite, and SQL Server. Each system has slight differences in syntax for certain commands, but the overall basic structure is the same across all database management systems. In order to work with databases, at least a basic understanding of the SQL language is desirable. Here we will give a brief overview of the main structure of a SQL query. If you have minimal SQL background and desire to learn more about databases, we encourage you to use numerous free online resources to help gain experience in using SQL. Some of my favorites include tutorials from W3Schools, tutorialspoint, and SQL Tutorial.

13.1.1.1 A Basic Query

Let’s first explore a little bit of SQL syntax. A basic SQL query consists of three different parts:

1. SELECT: used to select data from a database
2. FROM: tells SQL what table to look at to get the data
3. WHERE: used to filter out different records

Using these three statements you are able to perform queries on the database to extract data from different tables. For example, consider the database mentioned previously consisting of student records. To obtain all the email addresses from the email table, our query would look like this

SELECT email_address
FROM email

If we only desired the email addresses of the student with id = 001, then we would add a condition using the WHERE clause:

SELECT email_address
FROM email
WHERE student_id = '001'

Queries of this basic form are used by database programmers numerous times a day in order to obtain needed information from large databases. There are many more features that can be added to this basic query form, including:

1. AND: added to the WHERE clause, allows for multiple conditions at once
2. JOIN: connect two tables based on a common feature
3. GROUP BY: group data together according to a certain field to obtain statistics, counts, etc.

Does this remind you of anything we previously studied? Maybe something in Chapter 6? If so, you are correct! The manipulation of data sets in R using dplyr provides many of the same sort of manipulation tools used by SQL, and we will soon see how we can use dplyr directly to query external databases.

SQL also has the ability to update the database, insert new records, and delete records from the database. This is done using the UPDATE, INSERT, and DELETE statements. Here are some simple examples for each of these statements:

UPDATE email
WHERE student_id = '001'

DELETE FROM email

INSERT INTO email
VALUES ('001', 'doserjef@msu.edu')

Again, this is nowhere near a full introduction to the SQL language, but it will provide you with enough understanding to work with databases within the R framework. If you are interested in databases, we encourage you to seek out some of the additional resources mentioned at the beginning of this section.

13.2 Difficulties of Working with Large Datasets

The first step in using R with databases is to connect R to the database. This leads to many difficulties. As described by Edgar Ruiz, a solutions engineer at RStudio, when connecting to a database in R there is only a “small conduit” between R and the database that often prevents us from being able to work with the data as fast as we can work with local data frames. In order to work around this there have historically been two different options. The first option is to take data from the database in small chunks, save it in R memory, and then analyze it later. This is not desirable because you aren’t dealing with the data directly, so if the data in the database changes this will not be reflected in the data previously loaded into your R memory. A different option is to load all of the data into R. This allows you to see all of the data, but it takes forever to download and it essentially takes over the R session (and it has the same problems as option one regarding changes in the database).

The main problem with these methods is that they are trying to perform the computations locally in R as opposed to performing computations directly on the data in the database using the SQL Engine. Ideally, we want to be able to analyze the data in place. In other words, we want R to somehow send a SQL query over to the database, perform the query using the powerful SQL engine, and then have the database send back an R data frame that we can manipulate in R. This would allow us to avoid performing computations directly in R, which would improve program speed and reduce the amount of storage needed. In addition, we do not want to go back and forth between using R and SQL, as this can cause a lot of unnecessary confusion. As R programmers, we of course want to write solely R code. Fortunately, researchers at RStudio have increased the capabilities of the dplyr package to allow just that.

13.3 Using dplyr to Query the Database

The dplyr package has recently been updated to allow for better communications with external databases. Specifically, you, as the R programmer, can write code in dplyr acting on a database in the same manner in which you use dplyr to work with a data frame in the R environment. dplyr will then behind the scenes convert this R code into a SQL query, will send this query to the database, use the SQL engine to run the query on the data, and will return these data as a data frame in R.

This is the most efficient method for querying databases in R for three reasons:

1. We are pushing the computation to the database, allowing us to avoid bringing large amounts of data in the database into the R environment unless we specifically ask R to do it.
2. We can use piped code, which greatly enhances the readability of the code.
3. All the code is in R. This reduces the cognitive costs that are often associated when using two or more programming languages at the same time.

Next we will go through an example using SQLite to detail exactly how queries on a sample database are performed.

13.3.1 Example with RSQLite

SQLite is arguably the most widely used relational database management system throughout the world. We will first provide a demonstration using a SQLite database as it is relatively easy to install on your system compared to other database management systems. SQLite is different from most database management systems because you don’t have to set up a separate database server. This makes SQLite great for a demo, and surprisingly it can also be very powerful when working with many gigabytes of data. We will use a sample database provided by SQLite called chinook that contains data on store employees, customers, and the music they purchased.

First, go to http://www.sqlitetutorial.net/sqlite-sample-database/ and download the chinook database into your R working directory so you can follow along with this example yourself.

Upon successful download, the first step we need to do is install/load the necessary packages for connecting to the external database. To help dplyr communicate with the SQL engine, install the dbplyr package using install.packages("dbplyr"). Next we need the RSQLite package for interfacing with the SQLite database. Install this by running install.packages("RSQLite") in the R console. Then load the packages, along with dplyr:

library(dplyr)
library(dbplyr)
library(RSQLite)

We next use the DBI package to connect directly to the database. DBI is a backend package that provides a common interface for R to work with many different database management systems using the same code. This package does much of the communication from R to the database that occurs behind the scenes, and is an essential part of using dplyr to work with databases. If you have not yet installed the DBI package, you can do so using install.packages("DBI").

library(DBI)
chinook <- dbConnect(SQLite(), "chinook.db")

This command creates a reference to the database, and tells R to connect to this database in a specific location (your location could change depending on where you save the chinook.db file). A similar approach could be used to connect to databases of other database management systems like Oracle, MySQL, PostgreSQL, and others. Most databases do not live in a file, but instead live on another server. This causes the above connection to be much more complex, but for now we will focus on this simple case.

Now lets look closer at the chinook database.

src_dbi(chinook)
## src:  sqlite 3.38.5 [/home/jeffdoser/Dropbox/teaching/for875/textbook/chinook.db]
## tbls: albums, artists, customers, employees, genres,
##   invoice_items, invoices, media_types,
##   playlist_track, playlists, sqlite_sequence,
##   sqlite_stat1, tracks

The above function displays the location of the database, as well as the tables contained in the database. You can see in the chinook database there are numerous tables regarding customers, employees, and the music that customers purchased. In total, there are 13 tables contained in this database.

Now that we know the structure of the database, we can perform some simple queries on the data using dplyr syntax. For now, let’s focus on the employees table.

employees <- tbl(chinook, "employees")
employees
## # Source:   table<employees> [?? x 15]
## # Database: sqlite 3.38.5
## #   [/home/jeffdoser/Dropbox/teaching/for875/textbook/chinook.db]
##   EmployeeId LastName FirstName Title         ReportsTo
##        <int> <chr>    <chr>     <chr>             <int>
## 1          1 Adams    Andrew    General Mana…        NA
## 2          2 Edwards  Nancy     Sales Manager         1
## 3          3 Peacock  Jane      Sales Suppor…         2
## 4          4 Park     Margaret  Sales Suppor…         2
## 5          5 Johnson  Steve     Sales Suppor…         2
## 6          6 Mitchell Michael   IT Manager            1
## 7          7 King     Robert    IT Staff              6
## 8          8 Callahan Laura     IT Staff              6
## # … with 10 more variables: BirthDate <chr>,
## #   HireDate <chr>, Address <chr>, City <chr>,
## #   State <chr>, Country <chr>, PostalCode <chr>,
## #   Phone <chr>, Fax <chr>, Email <chr>

Notice how the employees table looks mostly like a regular tibble, but has a couple added lines detailing its location as a remote table in a SQLite database.

Now lets use familiar dplyr commands to perform queries on the database63.

employees %>%
select(LastName, FirstName, Phone, Email) %>%
arrange(LastName)
## # Source:     lazy query [?? x 4]
## # Database:   sqlite 3.38.5
## #   [/home/jeffdoser/Dropbox/teaching/for875/textbook/chinook.db]
## # Ordered by: LastName
##   LastName FirstName Phone             Email
##   <chr>    <chr>     <chr>             <chr>
## 1 Adams    Andrew    +1 (780) 428-9482 andrew@chinookc…
## 2 Callahan Laura     +1 (403) 467-3351 laura@chinookco…
## 3 Edwards  Nancy     +1 (403) 262-3443 nancy@chinookco…
## 4 Johnson  Steve     1 (780) 836-9987  steve@chinookco…
## 5 King     Robert    +1 (403) 456-9986 robert@chinookc…
## 6 Mitchell Michael   +1 (403) 246-9887 michael@chinook…
## 7 Park     Margaret  +1 (403) 263-4423 margaret@chinoo…
## 8 Peacock  Jane      +1 (403) 262-3443 jane@chinookcor…
employees %>%
filter(Title == "Sales Support Agent") %>%
arrange(LastName)
## # Source:     lazy query [?? x 3]
## # Database:   sqlite 3.38.5
## #   [/home/jeffdoser/Dropbox/teaching/for875/textbook/chinook.db]
## # Ordered by: LastName
##   <chr>    <chr>     <chr>
## 1 Johnson  Steve     7727B 41 Ave
## 2 Park     Margaret  683 10 Street SW
## 3 Peacock  Jane      1111 6 Ave SW
employees %>%
group_by(ReportsTo) %>%
summarize(numberAtLocation = n())
## # Source:   lazy query [?? x 2]
## # Database: sqlite 3.38.5
## #   [/home/jeffdoser/Dropbox/teaching/for875/textbook/chinook.db]
##   ReportsTo numberAtLocation
##       <int>            <int>
## 1        NA                1
## 2         1                2
## 3         2                3
## 4         6                2

You can see by using dplyr you are able to use the same syntax to query the database as you used to analyze data frames in Chapter 6. The most important difference to again note is that when working with remote databases the R code is translated into SQL and executed in the database using the SQL engine, not in R. When doing this, dplyr is as “lazy” as possible as it never pulls data in R unless explicity asked. It collects everything you ask it to do and then sends it to the database all in one step. This is often a very useful feature when desiring to look at a certain group of records in a database. To understand this further, take a look at the following code, which at first you might think will output a tibble or data frame:

salesSupportAgents <- employees %>%
filter(Title == "Sales Support Agent") %>%
arrange(LastName)
salesSupportAgents
## # Source:     lazy query [?? x 3]
## # Database:   sqlite 3.38.5
## #   [/home/jeffdoser/Dropbox/teaching/for875/textbook/chinook.db]
## # Ordered by: LastName
##   <chr>    <chr>     <chr>
## 1 Johnson  Steve     7727B 41 Ave
## 2 Park     Margaret  683 10 Street SW
## 3 Peacock  Jane      1111 6 Ave SW

Notice the first two lines in the output. The source is described as a lazy query and the salesSupportAgents is still a database connection and not a local data frame. Because you did not specifically tell R that you wanted to bring the data directly into R, it did not do so. In order to bring the data directly into R in a local tibble you need to use the collect() function. Note that if you are asking the database to send back a lot of data, this could take a while.

salesSupportAgents <- employees %>%
filter(Title == "Sales Support Agent") %>%
arrange(LastName) %>%
collect()
salesSupportAgents
## # A tibble: 3 × 3
##   <chr>    <chr>     <chr>
## 1 Johnson  Steve     7727B 41 Ave
## 2 Park     Margaret  683 10 Street SW
## 3 Peacock  Jane      1111 6 Ave SW

As we’ve seen, for simple tasks, and even many complex tasks, dplyr syntax can be used to query external databases.

13.3.2dbplot

If we can use dplyr to analyze the data in a database, you may be wondering whether or not we can use ggplot2 to graph the data in the database. Of course we can! In fact, the package dbplot is designed to process the calculations of a plot inside a database and output a ggplot2 object. If not already installed on your system, make sure to install the dbplot package before continuing.

We can use the same chinook database from SQLite we were using above. Suppose we desire to see how many types of each employee there are in the database. We can produce a barplot to show this.

library(dbplot)
employees %>%
dbplot_bar(Title)

We first load the dbplot package. Next we produce the bar plot. Notice that we can continue to use the convenient %>% character while producing graphs using dbplot, making the code easy to read. Since dbplot outputs a ggplot2 object, we can further customize the graph using familiar functions from ggplot2 (so long as the package is loaded).

library(ggplot2)
employees %>%
dbplot_bar(Title) +
labs(title = "Employee Types") +
ylab("Count") +
theme_classic()

For a second example, we will utilize a database from Google BigQuery, Google’s fully managed low cost analytics data warehouse. The package bigrquery provides an R interface to Google BigQuery. If not already installed, install the package with our usual method.

We will use data from a great set of sample tables that BigQuery provides to its users. First we will look at the shakespeare table that contains a word index of the different works of Shakespeare. Thus, given the data in the table, let’s use bigrquery, DBI, and dplyr to determine the ten words that appear most often in Shakespeare’s works.

library(bigrquery)
library(DBI)
library(dplyr)
billing <- "for875-databases"
con <- dbConnect(
bigquery(),
project = "publicdata",
dataset = "samples",
billing = billing
)
shakespeare <- con %>%
tbl("shakespeare")
shakespeare %>%
group_by(word) %>%
summarise(n = sum(word_count, na.rm = TRUE)) %>%
arrange(desc(n)) %>%
head(10)
## # Source:     lazy query [?? x 2]
## # Database:   BigQueryConnection
## # Ordered by: desc(n)
##    word      n
##    <chr> <int>
##  1 the   25568
##  2 I     21028
##  3 and   19649
##  4 to    17361
##  5 of    16438
##  6 a     13409
##  7 you   12527
##  8 my    11291
##  9 in    10589
## 10 is     8735

First we load the neceessary packages. We then provide the name of our project to the billing variable. In this case, I created a project called for875-databases for this course. The con variable establishes the connection to the database and the data that we want to look at. Notice the difference in connecting to the Google BigQuery database, which is stored remotely on Google’s servers, as compared to the previous example when connecting to a SQLite database stored on your computer’s hard drive. This example is much more realistic of the connections you would make when accessing external databases. The bigquery() statement tells DBI we will be working with a database from Google BigQuery. The project and dataset options tell DBI what database to look for inside of the Google BigQuery framework. We then supply the name of our project to the billing option to tell BigQuery what account to “bill” (this only costs money if you surpass the monthly limit given by Google).

If you run this code by yourself, you’ll notice that R will ask you to sign in to a google account (your MSU account), and then after you allow the bigrquery package to connect to your account, you will be able to run all the code without a problem.

From here we can establish a connection to a specific table by using the tbl() function as shown above. We can then use dplyr like we did previously to group the data by words, determine the count of each word, and then order it in decreasing order to obtain the top ten words in Shakespeare’s works. Notice that when you run this code you may obtain a message in red describing the total amount of bytes billed. This is how Google manages how much its users are querying and using its data. We’re just tinkering around here, so we won’t surpass the monthly limit on free queries.

This code again displays the benefits of the “lazy evaluation” that R employs. The con variable and shakespeare variable do not store the database or the specific table in R itself, instead they serve as references to the database where the specific data they are referencing is contained. Only when the dplyr query is written and called using the collect() function is any data from the database actually brought into R. R waits till the last possible second (i.e. is lazy) to perform any computations.

A second table in the publicdata project on Google BigQuery contains weather data from NOAA ranging from the years 1929 to 2010. Let’s first make a connection to this table and explore its structure.

weather <- con %>%
tbl("gsod")
dbListFields(con, "gsod")
##  [1] "station_number"
##  [2] "wban_number"
##  [3] "year"
##  [4] "month"
##  [5] "day"
##  [6] "mean_temp"
##  [7] "num_mean_temp_samples"
##  [8] "mean_dew_point"
##  [9] "num_mean_dew_point_samples"
## [10] "mean_sealevel_pressure"
## [11] "num_mean_sealevel_pressure_samples"
## [12] "mean_station_pressure"
## [13] "num_mean_station_pressure_samples"
## [14] "mean_visibility"
## [15] "num_mean_visibility_samples"
## [16] "mean_wind_speed"
## [17] "num_mean_wind_speed_samples"
## [18] "max_sustained_wind_speed"
## [19] "max_gust_wind_speed"
## [20] "max_temperature"
## [21] "max_temperature_explicit"
## [22] "min_temperature"
## [23] "min_temperature_explicit"
## [24] "total_precipitation"
## [25] "snow_depth"
## [26] "fog"
## [27] "rain"
## [28] "snow"
## [29] "hail"
## [30] "thunder"
## [31] "tornado"
weather %>%
select(thunder, mean_wind_speed) %>%
head(10)
## # Source:   lazy query [?? x 2]
## # Database: BigQueryConnection
##    thunder mean_wind_speed
##    <lgl>             <dbl>
##  1 FALSE              8
##  2 FALSE             17
##  3 FALSE              2.20
##  4 FALSE             11.3
##  5 TRUE               7
##  6 FALSE             13
##  7 TRUE              19.7
##  8 TRUE               0.5
##  9 FALSE              4.30
## 10 FALSE              5

First we establish a connection to the weather table. We then use the dbListFields() function from the DBI package to display the fields in the table. We then view the first 10 records in the table. We see that there are 31 different variables on information for the weather on a given day from 1929 to 2010. Let’s say we are interested in determining the total number of days where the mean_temp > 60 and mean_wind_speed > 10.

weather %>%
filter(mean_temp > 60 & mean_wind_speed > 10) %>%
summarize(count = n())
## # Source:   lazy query [?? x 1]
## # Database: BigQueryConnection
##     count
##     <int>
## 1 6581146

Further, suppose we are interested in determining how the average wind speed has changed from 1929 to 2010. We can use dbplot to plot the data.

weather %>%
dbplot_line(year, average_wind_speed = mean(mean_wind_speed, na.rm = TRUE))

Upon first glance at this plot, a naive student might think Wow! Wind speeds have decreased dramatically since the 1930s’’, and just accept this as true since that is what the graph shows. But since we are all data analysts at heart (otherwise you wouldn’t be taking this course!), we want to explore this further. We might ask why the wind was so high in the late 1920s and early 1930s? An average wind speed of above 12 miles per hour seems pretty high. Let’s explore the entries for all years in the 1930s and 1920s by looking at how many records there are with a mean wind speed for each year before 1940.

weather %>%
filter(year < 1940 & !is.na(mean_wind_speed)) %>%
group_by(year) %>%
summarize(count = n())
## # Source:   lazy query [?? x 2]
## # Database: BigQueryConnection
##     year count
##    <int> <int>
##  1  1929  2037
##  2  1930  7101
##  3  1933 17708
##  4  1938 51770
##  5  1931  9726
##  6  1934 20334
##  7  1935 26829
##  8  1939 65623
##  9  1932 10751
## 10  1936 50514
## # … with more rows

Interesting! We see that there are only 2037 records from 1929, while there are 65623 records from 1939. This suggests we could be experiencing a phenomenon known as sampling bias in which the 2037 records from 1929 are not a valid random representation of all the wind speeds occurring in 1929. Or potentially the wind was just much higher during that time due to other factors we aren’t exploring here. Determining the true cause of this pattern requires further investigation, which we leave to you if you so desire.

13.4 Changing Records in the Database

dplyr is fantastic for retrieving data from databases and analyzing it using familiar functions. As detailed in this video, using dplyr is considered a “best practice” when using databases in R because you only have to use one language, you don’t need to know any SQL, and you can use the lazy evaluation of R, among many other reasons. As a data analyst, a majority of work can be accomplished by using dplyr and other forms of queries against a database. But what if you need to do more than query the data in the database and you have to change some of the records? dplyr is not capable of inserting or deleting records in a database, and although updating recordings is possible with dplyr, Hadley Wickham strongly recommends against it, as dplyr is designed specifically for retrieving data. Thus, we need to use other options for updating/inserting/deleting records in a database. Unlike using dplyr this requires a more in depth knowledge of SQL. We will not discuss updating/inserting/deleting records in a database here, but you will briefly learn about methods to do this in this chapter’s exercise.

13.5 R Studio Connections Pane

If by this point in the book you haven’t already been convinced that RStudio is a fantastic IDE for running R then perhaps this will convince you. R Studio has a Connections Pane by default in the top right corner of your screen that makes it easy to connect to a variety of data sources, and better yet, explore the objects and data inside the connection. You can use it with a variety of different packages for working with databases in R, and it will also work with a variety of databases and other datasources (i.e. Spark). This Connections Pane is a great way of exploring the data in a data source once you are connected. If interested, the page here which provides an overview of the connections pane and some other useful links of potential interest.

13.6 Further Resources

There are numerous other methods besides dplyr to work with databases in R that involve using more standard SQL queries. If you come from a SQL background and want to use these other methods, then explore the RStudio Databases using R page. This website will be an essential resource for you if you get more involved with using databases within the R environment.

13.7 Exercises

Exercise Databases Learning objectives: connect to an external database; perform simple queries using dplyr; use data from a database in a Shiny app; learn how to perform changes (update/delete/insert) on a database

1. The pronunciation of SQL is highly debated. I prefer to pronounce it like the word “sequel”, but many others prefer to say it as “ess-que-ell”. See https://softwareengineering.stackexchange.com/questions/8588/whats-the-history-of-the-non-official-pronunciation-of-sql for a good debate↩︎

2. You may see a warning here. You can safely ignore it.↩︎