This lesson is in the early stages of development (Alpha version)

R for Plotting

Overview

Teaching: 90 min
Exercises: 30 min
Questions
  • How do I read data into R?

  • What are geometries and aesthetics?

  • How can I use R to create and save professional data visualizations?

Objectives
  • To be able to read in data from csv files.

  • To create plots with both discrete and continuous variables.

  • To understand mapping and layering using ggplot2.

  • To be able to modify a plot’s color, theme, and axis labels.

  • To be able to save plots to a local directory.

Contents

  1. The “goal” of the workshop
  2. Overview of the lesson
  3. Directory structure
  4. Loading and reviewing data
  5. Our first plot
  6. Plotting for data exploration
  7. Applying it to your own data
  8. Glossary of terms

The “goal” of the workshop

Our goal is to write a report to the United Nations on the relationship between lung cancer, smoking, and air pollution. In other words, we are going to analyze how countries’ smoking rates and air pollution may be related to the percent of people with lung cancer.

To get to that point, we’ll need to learn how to manage data, make plots, and generate reports. The next section discusses in more detail exactly what we will cover.

Overview of the lesson

In this lesson, we will go over how to read tabular data into R (e.g. from a csv file) and plot it for exploratory data analysis.

Exercise: Create a new R Script file

We would like to create a file where we can keep track of our R code. On your own, create a file called plotting.R in the project directory.

Solution

Navigate to the “File” menu in RStudio. You’ll see the first option is “New File.” Selecting “New File” opens another menu to the right, and the first option is “R Script.” Select “R script.” Alternatively, you can click on the white square button with a green plus sign in the upper left corner and select “R Script.”

Now you have an untitled R Script in your Editor tab. Save this file as plotting.R in our project directory.

Directory structure

Exercise: File organization

  1. When you’re working on a project, how do you organize your files?
  2. Take a look at your un-report directory. You should be able to see it in the bottom right side of your screen under the “Files” tab. What folders are there, and why do you think they’re there?

Solution

There are lots of different ways to organize files, but you should have some consistent method of organizing them so that it’s easy to find what you want. If you have all of your files in one folder, it can get kind of confusing to find what you need. In the un-report directory there are three “sub-directories”: data, figures, and reports.

  • data contains all of the data that we will need for the workshop.
  • figures is where we will save the figures we generate during the workshop.
  • reports is where we will save our final report.

Loading and reviewing data

The tidyverse vs Base R

If you’ve used R before, you may have learned commands that are different than the ones we will be using during this workshop. We will be focusing on functions from the tidyverse. The “tidyverse” is a collection of R packages that have been designed to work well together and offer many convenient features that do not come with a fresh install of R (aka “base R”). These packages are very popular and have a lot of developer support including many staff members from RStudio. These functions generally help you to write code that is easier to read and maintain. We believe learning these tools will help you become more productive more quickly.

First, we’re going to load the tidyverse package. Packages are useful because they contain pre-made functions to do specific tasks. Tidyverse contains a set of functions that makes it easier for us to do complex analyses and create professional visualizations in R. The way we access all of these useful functions is by running the following command:

library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.2     ✔ readr     2.1.4
✔ forcats   1.0.0     ✔ stringr   1.5.0
✔ ggplot2   3.4.2     ✔ tibble    3.2.1
✔ lubridate 1.9.2     ✔ tidyr     1.3.0
✔ purrr     1.0.1     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

When you loaded the tidyverse package, you probably got a message like the one we got above. This isn’t an error! These messages are just giving you more information about what happened when you loaded tidyverse. For now, we don’t have to worry about the messages. You can read the bonus section below for more details.

Bonus: What’s with all those messages???

The tidyverse messages give you more information about what happened when you loaded tidyverse. The tidyverse is actually a collection of several different packages, so the first section of the message tells us what packages were installed when we loaded tidyverse (these include ggplot2, which we’ll be using a lot in this lesson, and dyplr, which you’ll be introduced to tomorrow in the R for Data Analysis lesson).

The second section of messages gives a list of “conflicts.” Sometimes, the same function name will be used in two different packages, and R has to decide which function to use. For example, our message says that:

dplyr::filter() masks stats::filter()

This means that two different packages (dyplr from tidyverse and stats from base R) have a function named filter(). By default, R uses the function that was most recently loaded, so if we try using the filter() function after loading tidyverse, we will be using the filter() function from dplyr().

Okay, now let’s read in our data, smoking_cancer_1990.csv. To do this, we need to know the file path, which tells R where to find the file on your computer. When you have a project open in R, it starts looking from your main project folder, in our case un-report. Inside un-report, we have a folder called data, and in that folder is the smoking_cancer_1990.csv file. This is the file that contains the data that we want to plot. So the file path from our main project directory is: data/smoking_cancer_1990.csv. The / tells R that the file is in the data directory.

We’re going to use the read_csv() function that we loaded in with the tidyverse, and save it to smoking_1990, which will act as a placeholder for our data. This function takes a file path and returns a tibble, which is basically a table (that we sometimes call a data frame…).

smoking_1990 <- read_csv("data/smoking_cancer_1990.csv")
Rows: 191 Columns: 6
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (2): country, continent
dbl (4): year, pop, smoke_pct, lung_cancer_pct

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

A few things printed out to the screen: it tells us how many rows and columns are in our data, and information about each of the columns. Each row contains the continent (“continent”), the total population (“pop”), the age-standardized percent of people who smoke (“smoke_pct”), and the age-standardized percent of people who have lung cancer (“lung_cancer_pct”) for a given country (“country”). We can see that two of the columns are characters (categorical variables), and three are doubles (numbers).

Bonus: Characters vs. factors

Note: In anything before R 4.0, categorical variables used to be read in as factors, which are special data objects that are used to store categorical data and have limited numbers of unique values. The unique values of a factor are tracked via the “levels” of a factor. A factor will always remember all of its levels even if the values don’t actually appear in your data. The factor will also remember the order of the levels and will always print values out in the same order (by default this order is alphabetical).

If your columns are stored as character values but you need factors for plotting, ggplot will convert them to factors for you as needed.

Now let’s look at the data a bit more. In the Environment tab in the upper right corner of RStudio, you will now see smoking_1990 listed. If you click on it, it will pop up in a tab next to your script.

After we’ve reviewed the data, you’ll want to make sure to click the tab in the upper left to return to your plotting.R file so we can start writing some code.

Another way to look at the data is to print it out to the console:

smoking_1990
# A tibble: 191 × 6
    year country             continent          pop smoke_pct lung_cancer_pct
   <dbl> <chr>               <chr>            <dbl>     <dbl>           <dbl>
 1  1990 Afghanistan         Asia          12412311      3.12          0.0127
 2  1990 Albania             Europe         3286542     24.2           0.0327
 3  1990 Algeria             Africa        25758872     18.9           0.0118
 4  1990 Andorra             Europe           54508     36.6           0.0609
 5  1990 Angola              Africa        11848385     12.5           0.0139
 6  1990 Antigua and Barbuda North America    62533      6.80          0.0105
 7  1990 Argentina           South America 32618648     30.4           0.0344
 8  1990 Armenia             Europe         3538164     30.5           0.0441
 9  1990 Australia           Oceania       17065100     29.3           0.0599
10  1990 Austria             Europe         7677850     35.4           0.0439
# ℹ 181 more rows

The read_csv() function took the file path we provided, did who-knows-what behind the scenes, and then outputted an R object with the data stored in that csv file. All that, with one short line of code!

Data objects

There are many different ways to store data in R. Most objects have a table-like structure with rows and columns. We will refer to these objects generally as “data objects”. If you’ve used R before, you many be used to calling them “data frames”. Functions from the tidyverse such as read_csv() work with objects called “tibbles”, which are a specialized kind of “data frame.” Another common way to store data is a “data table”. All of these types of data objects (tibbles, data frames, and data tables) can be used with the commands we will learn in this lesson to make plots. We may sometimes use these terms interchangeably.

Bonus Exercise: Reading in an excel file

Say you have an excel file and not a csv - how would you read that in? Hint: Use the Internet to help you figure it out!

Solution

One way is using the read_excel function in the readxl package. Hint: you may need to use install.packages() to install the readxl package. There are other ways to read in excel files, but this is our preferred method because the output will be the same as the output of read_csv.

Our first plot

Creating our first plot

Back to top

We will be using the ggplot2 package, which is part of the tidyverse, to make our plots. This is a very powerful package that creates professional looking plots and is one of the reasons people like using R so much.

When making a plot, you first have to come up with a question you wish to answer related to your data. Here, we are interested in whether there is a relationship between the percent of people who smoke and the percent of people with lung cancer.

What do we want to plot?

Given that we are interested in whether there is a relationship between the percent of people who smoke and the percent of people with lung cancer:

  1. What variables would you want to put on the x and y axes?
  2. What columns do those variables correspond to in our smoking_1990 dataset?
  3. What type of plot would you want to make?

Hint: take a look at this blog post if you need ideas about which plot type might be good.

Solution

A scatter plot with percent of people who smoke (smoke_pct column in our data) on the x axis and percent of people with lung cancer (lung_cancer_pct column in our data) on the y axis will allow you to visualize the correlation between these two variables.

Now that we’ve figured out what we want to plot and what columns in our dataset we need to use, let’s get started! All plots made using the ggplot2 package start by calling the ggplot() function. In the tab you created for the plotting.R file, type the following:

ggplot(data=smoking_1990)

To run code that you’ve typed in the editor, you have a few options. Remember that the quickest way to run the code is by pressing Ctrl+Enter on your keyboard. This will run the line of code that currently contains your cursor or any highlighted code.

When we run this code, the Plots tab will pop to the front in the lower right corner of the RStudio screen. Right now, we just see a big grey rectangle.

What we’ve done is created a ggplot object and told it we will be using the data from the smoking_1990 object that we’ve loaded into R. We’ve done this by calling the ggplot() function with smoking_1990 as the data argument.

So we’ve made a plot object, now we need to start telling it what we actually want to draw on this plot.

The elements of a plot have a bunch of properties like an x and y position, a size, a color, etc. When creating a data visualization, we can map variables in our dataset to these properties, called aesthetics, in our plot. In ggplot, we can do this by creating an “aesthetic mapping”, which we do with the aes() function.

To create our plot, we need to map variables from our smoking_1990 object to ggplot aesthetics using the aes() function. Since we have already told ggplot that we are using the data in the smoking_1990 object, we can access the columns of smoking_1990 using the object’s column names. (Remember, R is case-sensitive, so we have to be careful to match the column names exactly!)

Let’s start by telling our plot object that we want to map our smoking values to the x axis of our plot.
We do this by adding (+) information to our plot object. Add this new line to your code and run both lines by highlighting them and pressing Ctrl+Enter on your keyboard:

ggplot(data = smoking_1990) +
  aes(x = smoke_pct)

Note that we’ve added this new function call to a second line just to make it easier to read. To do this we make sure that the + is at the end of the first line otherwise R will assume your command ends when it starts the next row. The + sign indicates not only that we are adding information, but to continue on to the next line of code.

Observe that our Plot window is no longer a grey square. We now see that we’ve mapped the smoke_pct column to the x axis of our plot. Note that that column name isn’t very pretty as an x-axis label, so let’s add the labs() function to make a nicer label for the x axis:

ggplot(data = smoking_1990) +
  aes(x = smoke_pct) +
  labs(x = "Percent of people who smoke")

Quotes vs. No Quotes (refresher)

Notice that when we added the label value we did so by placing the values inside quotes. This is because we are not using a value from inside our data object - we are providing the name directly. When you need to include actual text values in R, they will be placed inside quotes to tell them apart from other object or variable names.

The general rule is that if you want to use values from the columns of your data object, then you supply the name of the column without quotes, but if you want to specify a value that does not come from your data, then use quotes.

Mapping lung cancer rates to the y axis

Map our lung_cancer_pct values to the y axis and give them a nice label.

Solution

ggplot(data = smoking_1990) +
  aes(x = smoke_pct) +
  labs(x = "Percent of people who smoke") +
  aes(y = lung_cancer_pct) +
  labs(y = "Percent of people with lung cancer")

Excellent. We’ve now told our plot object where the x and y values are coming from and what they stand for. But we haven’t told our object how we want it to draw the data. There are many different plot types (bar charts, scatter plots, histograms, etc). We tell our plot object what to draw by adding a geometry (“geom” for short) to our object. We will talk about many different geometries today, but for our first plot, let’s draw our data using the “points” geometry for each value in the data set. To do this, we add geom_point() to our plot object:

ggplot(data = smoking_1990) +
  aes(x = smoke_pct) +
  labs(x = "Percent of people who smoke") +
  aes(y = lung_cancer_pct) +
  labs(y = "Percent of people with lung cancer") +
  geom_point()

Now we’re really getting somewhere. It finally looks like a proper plot! We can now see a trend in the data. It looks like countries with greater smoking rates tend to have higher lung cancer rates, though it’s important to remember that we can’t infer causality from this plot alone.

Let’s add a title to our plot to make that clearer. Again, we will use the labs() function, but this time we will use the title = argument.

ggplot(data = smoking_1990) +
  aes(x = smoke_pct) +
  labs(x = "Percent of people who smoke") +
  aes(y = lung_cancer_pct) +
  labs(y = "Percent of people with lung cancer") +
  geom_point() +
  labs(title = "Are lung cancer rates associated with smoking rates?")

No one can deny we’ve made a very handsome plot! We can immediately see that there is a positive association between lung cancer rates and smoking rates. But now looking at the data, we might be curious about learning more about the points that are the extremes of the data. We know that we have two more pieces of data in the smoking_1990 object that we haven’t used yet. Maybe we are curious if the different continents show different patterns in smoking rates and lung cancer rates. One thing we could do is use a different color for each of the continents. To map the continent of each point to a color, we will again use the aes() function.

Color the points by continent

To color the points by continent, you will need to add that to your aesthetic function. Fill in the blank below with the correct column from your data to make this happen.

ggplot(data = smoking_1990) +
  aes(x = smoke_pct) +
  labs(x = "Percent of people who smoke") +
  aes(y = lung_cancer_pct) +
  labs(y = "Percent of people with lung cancer") +
  geom_point() +
  labs(title = "Are lung cancer rates associated with smoking rates?") +
  aes(_____)

What information can you learn from the plot?

Solution

ggplot(data = smoking_1990) +
  aes(x = smoke_pct) +
  labs(x = "Percent of people who smoke") +
  aes(y = lung_cancer_pct) +
  labs(y = "Percent of people with lung cancer") +
  geom_point() +
  labs(title = "Are lung cancer rates associated with smoking rates?") +
 aes(color = continent)

Here we can see that in 1990 the African countries tended to have much lower smoking and lung cancer rates than many other continents.

Notice that when we add a mapping for color, ggplot automatically provided a legend for us. It took care of assigning different colors to each of our unique values of the continent variable. (Note that when we mapped the x and y values, those drew the actual axis labels, so in a way the axes are like the legends for the x and y values).

The colors that ggplot uses are determined by the color “scale”. Each aesthetic value we can supply (x, y, color, etc) has a corresponding scale. Let’s change the colors to make them a bit prettier. While we’re at it, let’s capitalize the legend title too.

ggplot(data = smoking_1990) +
  aes(x = smoke_pct) +
  labs(x = "Percent of people who smoke") +
  aes(y = lung_cancer_pct) +
  labs(y = "Percent of people with lung cancer") +
  geom_point() +
  labs(title = "Are lung cancer rates associated with smoking rates?") +
  aes(color = continent) +
  scale_color_brewer(palette = "Set2") +
  labs(color = "Continent")

The scale_color_brewer() function is just one of many you can use to change colors. There are bunch of “palettes” that are build in. You can view them all by running RColorBrewer::display.brewer.all() or check out the Color Brewer website for more info about choosing plot colors. Check out the bonus exercise below for even more options.

Bonus Exercise: Changing colors

There are lots of ways to change colors when using ggplot. The scale_color_brewer() function is one of many you can use to change colors. There are bunch of “palettes” that are build in. You can view them all by running RColorBrewer::display.brewer.all() or check out the Color Brewer website for more info about choosing plot colors. There are also lots of other fun options:

Play around with different color palettes. Feel free to install another package and choose one of those if you want. Pick your favorite!

Solution

You can use RColorBrewer::display.brewer.all() to pick a color palette. As a bonus, you can also use one of the packages listed above. Here’s an example:

ggplot(data = smoking_1990) +
  aes(x = smoke_pct) +
  labs(x = "Percent of people who smoke") +
  aes(y = lung_cancer_pct) +
  labs(y = "Percent of people with lung cancer") +
  geom_point() +
  labs(title = "Are lung cancer rates associated with smoking rates?") +
  aes(color = continent) +
  labs(color = "Continent") +
  scale_color_viridis_d(option = "turbo")

Since we have the data for the population of each country, we might be curious about the relationship between population, smoking rates, and lung cancer rates. Do you think larger countries will have a greater or lower lung cancer rate? Let’s find out by mapping the population of each country to the size of our points.

Changing point sizes

Map the population of each country to the size of our points. HINT: Is size an aesthetic or a geometry? If you’re stuck, feel free to Google it, or look at the help menu.

Solution

ggplot(data = smoking_1990) + aes(x = smoke_pct) + labs(x = “Percent of people who smoke”) + aes(y = lung_cancer_pct) + labs(y = “Percent of people with lung cancer”) + geom_point() + labs(title = “Are lung cancer rates associated with smoking rates?”) + aes(color = continent) + scale_color_brewer(palette = “Set2”) + labs(color = “Continent”) ```

There doesn’t seem to be a very strong association with population size. We also got another legend here for size, which is nice, but the values look a bit ugly in scientific notation. Let’s divide all the values by 1,000,000 and label our legend “Population (in millions)”

ggplot(data = smoking_1990) +
  aes(x = smoke_pct) +
  labs(x = "Percent of people who smoke") +
  aes(y = lung_cancer_pct) +
  labs(y = "Percent of people with lung cancer") +
  geom_point() +
  labs(title = "Are lung cancer rates associated with smoking rates?") +
  aes(color = continent) +
  scale_color_brewer(palette = "Set2") +
  labs(color = "Continent") +
  aes(size = pop/1000000) +
  labs(size = "Population (in millions)")

This works because you can treat the columns in the aesthetic mappings just like any other variables and can use functions to transform or change them at plot time rather than having to transform your data first.

Good work! Take a moment to appreciate what a cool plot you made with a few lines of code. To fully view its beauty you can click the “Zoom” button in the Plots tab - it will break free from the lower right corner and open the plot in its own window.

Bonus Exercise: Changing shapes

Instead of (or in addition to) color, change the shape of the points so each continent has a different shape. (I’m not saying this is a great thing to do - it’s just for practice!) HINT: Is shape an aesthetic or a geometry? If you’re stuck, feel free to Google it, or look at the help menu.

Solution

You’ll want to use the aes aesthetic function to change the shape:

ggplot(data = smoking_1990) +
  aes(x = smoke_pct) +
  labs(x = "Percent of people who smoke") +
  aes(y = lung_cancer_pct) +
  labs(y = "Percent of people with lung cancer") +
  geom_point() +
  labs(title = "Are lung cancer rates associated with smoking rates?") +
  aes(color = continent) +
  scale_color_brewer(palette = "Set2") +
  labs(color = "Continent") +
  aes(size = pop) +
  aes(size = pop/1000000) +
  labs(size = "Population (in millions)") +
  aes(shape = continent)

For our first plot we added each line of code one at a time so you could see the exact affect it had on the output. But when you start to make a bunch of plots, we can actually combine many of these steps so you don’t have to type as much. For example, you can collect all the aes() statements and all the labs() together. A more condensed version of the exact same plot would look like this:

ggplot(data = smoking_1990) +
  aes(x = smoke_pct, y = lung_cancer_pct, color = continent, size = pop/1000000) +
  geom_point() +
  scale_color_brewer(palette = "Set2") +
  labs(x = "Percent of people who smoke", y = "Percent of people with lung cancer",
    title = "Are lung cancer rates associated with smoking rates?", color = "Continent", size = "Population (in millions)")

Storing our plot

We learned about how to save things to object names in the previous lesson. We can do the same thing with plots! Store our final plot in an object called cancer_v_smoke.

Solution

 cancer_v_smoke <- ggplot(data = smoking_1990) +
   aes(x = smoke_pct, y = lung_cancer_pct, color = continent, size = pop/1000000) +
   geom_point() +
   scale_color_brewer(palette = "Set2") +
   labs(x = "Percent of people who smoke", y = "Percent of people with lung cancer",
     title = "Are lung cancer rates associated with smoking rates?", color = "Continent", size = "Population (in millions)")

Saving our first plot

Back to top

Let’s say you want to share your plot with friends or co-workers who aren’t running R. It’s wise to keep all the code you used to draw the plot, but sometimes you need to make a PNG or PDF version of the plot so you can share it with your PI or post it to your Instagram story.

To save your plot, you can use the ggsave() function. A few things about ggsave() (the good and the bad):

  1. [The bad] By default, ggsave() will save the last plot you made, but this can get confusing so it’s best to create a plot object and then save the specific plot you’re interested in instead.
  2. [The neutral] The default width and height are sometimes not great options, but you can supply width= and height= arguments to change them (the default values are in inches).
  3. [The good] It will determine the file type based on the name you provide.

Let’s save our plot (with an informative name) as a 4x6 inch png:

ggsave(filename = "figures/cancer_v_smoke.png", plot = cancer_v_smoke, width = 6, height = 4)

Debugging code

Debugging is the process of finding and fixing errors or unexpected outputs in your code. Even well seasoned coders run into bugs all the time.

Here are some strategies of how programmers try to deal with coding errors:

  1. Don’t panic. Bugs are a normal part of the coding process.
  2. If you are getting an error message, read the error message carefully. Unfortunately, not all error messages are well written and it may not be obvious at first what is wrong.
  3. Check for typos.
    1. Check that your parentheses and quotes are balanced and check that you haven’t misspelled a variable or function name, or used the wrong one.
    2. It’s difficult to identify the exact location where an error starts so you may have to look at lines before the line where the error was reported.
    3. In RStudio, look at the code coloring to find anything that looks off. RStudio will also put a red x or an yellow exclamation point to the left of lines where there is a syntax error.
  4. Try running each command on its own.
    1. Before each command, check that you are passing the values you expect.
    2. After each command, verify that the results seem sensible.
  5. If you’re getting an error, search online for the error message along with the function that is not working.

Consider checking out the following resources to learn more about it.

Understanding common bugs

Sometimes you accidentally type things wrong and get unexpected results or errors. We call these mis-types “bugs”. Let’s go through some common ones. The most important things to remember are:

  1. The order of parentheses, quotes, commas, and plusses matters.
  2. Sometimes you accidentally forget a plus where you need one or include one where you don’t.

For each of the examples below, figure out what the bug is and how to fix it. Feel free to copy/paste into RStudio to help you figure it out.

# Bug 1

 ggplot(data = smoking_1990) +
 aes(x = "smoke_pct", y = lung_cancer_pct, color = continent, size = pop/1000000) +
   geom_point() +
   labs(x = "Percent of people who smoke", y = "Percent of people with lung cancer",
        title = "Are lung cancer rates associated with smoking rates?",
       color = "Continent", size = "Population (in millions)") 

# Bug 2

ggplot(data = smoking_1990) +
   aes(x = smoke_pct, y = lung_cancer_pct, color = continent, size = pop/1000000) +
   geom_point() +
   labs(x = "Percent of people who smoke", y = "Percent of people with lung cancer",
        title = "Are lung cancer rates associated with smoking rates?",
       color = "Continent", size = "Population (in millions)"))

# Bug 3
ggplot(data = smoking_1990) +
   aes(x = smoke_pct, y = lung_cancer_pct, color = continent, size = pop/1000000) +
   geom_point() +
   labs(x = "Percent of people who smoke", y = "Percent of people with lung cancer",
        title = "Are lung cancer rates associated with smoking rates?",
       color = "Continent" size = "Population (in millions)")

Solution

Bug 1: We generated a plot, but it doesn’t look like what we expect. The bug is in our mapping of aesthetics: geom_point(aes(x = "smoke_pct", y = lung_cancer_pct, color = continent, size = pop/1000000)). Because "smoke_pct" is in quotation marks, ggplot understands that as a single value, rather than an aesthetic mapped to the smoke_pct variable in the smoking_1990 dataset. To correct this bug, remove the quotes from "smoke_pct" so that ggplot looks for the smoke_pct column in our dataset.

Bug 2: This code generates the following error:

Error: unexpected ')' in:
"      scale_color_brewer(palette = "Set2") +
     labs(x = "Percent of people who smoke", y = "Percent of people  with lung cancer", title = "Are lung cancer rates associated with smoking rates?", size = "Population (in millions)"))"

Although it might be alarming to get this error, it’s actually quite helpful! You can see that the error points out that we have an unexpected closed parentheses “)” in the last two lines of our code. Look closely, and you’ll see that we accidentally put an extra “)” on the labs() layer in the last line of code.

Bug 3: This code generates the following error:

Error: unexpected symbol in:
"                       title = "Are lung cancer rates associated with smoking rates?",
                      color = "Continent" size"

This error message tells us that there was something unexpected in the labs() funtion on either the line where we specified the title or the color. These errors can be some of the hardest to figure out, because the message is not very specific. However, if you look closely, yu will see that we are missing a comma between color = "Continent" and size = "Population (in millions)" inside the label function.

Recap of what we’ve learned so far

Back to top

Now that we’ve made our first plot, let’s review the most important things to remember when plotting with ggplot. Making plots using ggplot is all about layering on information.

  1. First, you have to give the ggplot() function your data.
    1. It looks in this data for the information in the columns you tell it to use for your plot.
  2. Then, you have to tell ggplot what specific information from your data you want to plot and how you want that data to show up.
    1. You use the aes() function for this.
    2. Inside this function you tell it how you want the data to show up (on the x axis, on the y axis, as a color, etc.) and where that data is coming from (the column name in your datset).
  3. Finally, you have to tell ggplot what type of plot you want to make.
    1. All ggplot plot types start with the word geom (e.g. geom_point()).
  4. You can customize the labels and colors on your plot to make them nicer and more informative.
    1. There’s a lot more you can customize as well. We will go into some of this later on in the lesson.

This is a lot to remember!

Pro-tip

Those of us that use R on a daily basis use cheat sheets to help us remember how to use various R functions. You can find the cheat sheets in RStudio by going to the “Help” menu and selecting “Cheat Sheets”. The ones that will be most helpful in this workshop are “Data visualization with ggplot2”, “Data Transformation with dplyr”, “R Markdown Cheat Sheet”, and “R Markdown Reference Guide”.

For things that aren’t on the cheat sheets, Google is your best friend. Even expert coders use Google when they’re stuck or trying something new!

Let’s take a moment to orient ourselves to our “Data visualization with ggplot2” cheat sheet. What we just went over is summarized in the “Basics” section in the upper left hand side of the front page of your cheat sheet. The other sections contain more information about different geometries and aesthetics you can use in your plots. We will go over some of these in the next section.

Bonus Exercise: Make your own scatter plot

Now create your own scatter plot comparing population and percent of people who smoke. Looking at your plot, can you guess which two countries have the largest populations?

If you have extra time, customize your plot however you want. If there’s something you want to do but don’t know how, try searching on the internet for it.

Solution

ggplot(data = smoking_1990) +
  aes(x = pop, y = smoke_pct) +
  geom_point()

(China and India are the two countries with large populations.)

Plotting for data exploration

Back to top

Now that we’ve made our first plot, we’re going to dig into other ways to visualize data using ggplot. The main goal here is to find meaningful patterns in complex data and create visualizations to convey those patterns.

Discrete Plots

Back to top

The plot type we used to make our first plot, geom_point, works when both the x and y values are continuous. But sometimes one of your values may be discrete (i.e. categorical).

We’ve previously used the discrete values of the continent column to color in our points and lines. But now let’s try moving that variable to the x axis. Let’s say we are curious about comparing the distribution of the lung cancer rates for each of the different continents for the smoking_1990 data. We can do so using a box plot. Try this out yourself in the exercise below!

Plotting and interpreting box plots

Using the smoking_1990 data, use ggplot to create a box plot with continent on the x axis and lung cancer rates on the y axis. The geom you will want to use is geom_boxplot(). You can use the examples from earlier in the lesson as a template to remember how to pass ggplot data and map aesthetics and geometries onto the plot. If you’re really stuck, feel free to use the internet as well!

Which continent tends to have countries with the highest lung cancer rates? The lowest?

Solution

ggplot(data = smoking_1990) +
 aes(x = continent, y = lung_cancer_pct) +
 geom_boxplot()

This type of visualization makes it easy to compare the range and spread of values across groups. The “middle” 50% of the data is located inside the box and outliers that are far away from the central mass of the data are drawn as points. The bar in the middle of the box is the median. Here, we can see that the median bar for Europe is highest, indicating that countries in Europe tend to have higher rates of lung cancer than countries on other continents. Countries in Africa tend to have lower lung cancer rates than countries on other continents.

Bonus Exercise: Other discrete geoms

Take a look a the ggplot cheat sheet. Find all the geoms listed under “one discrete, one continuous.” Try replacing geom_boxplot with one of these other functions.

Example solution

ggplot(data = smoking_1990) +
  aes(x = continent, y = lung_cancer_pct) +
  geom_violin() 

Color vs. Fill

Back to top

Let’s take the boxplot that we made previously and add code to make the color corresponds to continent. Remember how to do that?

ggplot(data = smoking_1990) +
  aes(x = continent, y = lung_cancer_pct, color = continent) +
  geom_boxplot()

Well, that didn’t get all that colorful. That’s because objects like these boxplots have two different parts that have a color: the shape outline, and the inner part of the shape. For geoms that have an inner part, you change the fill color with fill= rather than color=, so let’s try that instead:

ggplot(data = smoking_1990) +
  aes(x = continent, y = lung_cancer_pct, fill = continent) +
  geom_boxplot()

That got more colorful. Neither one of these (color vs. fill) is better than the other here, it’s more up to your personal preference.

Let’s say we want to change the fill of our plots, but to all the same color. Maybe we want our boxplots to be “lightblue”.

Quotes or no quotes?

To change the color of our boxplots to lightblue, do you think we need to put lightblue in quotes or not? Why?

Solution

We want to put it in quotes because it isn’t a column name in our dataset or a variable in our environment.

Let’s try it out without quotes first:

ggplot(data = smoking_1990) +
  aes(x = continent, y = lung_cancer_pct, fill = lightblue) +
  geom_boxplot()
Error in `geom_boxplot()`:
! Problem while computing aesthetics.
ℹ Error occurred in the 1st layer.
Caused by error in `FUN()`:
! object 'lightblue' not found

Like we just discussed, we get an error because when we don’t include quotes, R looks for the lightblue object in our dataframe and our environment, but it doesn’t find it there. Instead, we have to put it in quotes so that R knows not to search for that variable, but instead to actually use the word itself:

ggplot(data = smoking_1990) +
  aes(x = continent, y = lung_cancer_pct, fill = "lightblue") +
  geom_boxplot()

Hmm that’s still not quite what we want. In this example, we placed the fill inside the aes() function, which maps aesthetics to data. In this case, we only have one value: the word “lightblue”. Instead, let’s do this by explicitly setting the color aesthetic inside the geom_boxplot() function. Because we are assigning a color directly and not using any values from our data to do so, we do not need to use the aes() mapping function. Let’s try it out:

ggplot(data = smoking_1990) +
  aes(x = continent, y = lung_cancer_pct) +
  geom_boxplot(fill = "lightblue")

That’s better! R knows many color names. You can see the full list if you run colors() in the console. Since there are so many, you can randomly choose 10 if you run sample(colors(), size = 10).

Choosing a color

Use sample(colors(), size = 10) a few times until you get an interesting sounding color name and swap that out for “lightblue” in the box plot example.

Layers

Back to top

So far we’ve only been adding one geom to each plot, but each plot object can actually contain multiple layers and each layer has it’s own geom. Now let’s add a layer of points on top of our boxplot that will show us the “raw” data:

ggplot(data = smoking_1990) +
  aes(x = continent, y = lung_cancer_pct) +
  geom_boxplot(fill = "lightblue") +
  geom_point()

We’ve drawn the points but most of them stack up on top of each other. One way to make it easier to see all the data is to change the transparency of the points. We can do this using the alpha argument, which decides how transparent to make the points. It takes a value between 0 and 1 where 0 is entirely transparent and 1 is entirely opaque (the default).

Inside aes() or geom?

Let’s say we want to change the transparency of our points to an alpha of 0.3. Do we want alpha to go inside our aes() function or our geom_boxplot()? Why? Test out both and see if you’re right!

Solution

We want alpha to go inside geom_boxplot() since we’re telling ggplot the number we want it to use; it’s not coming from our data.

ggplot(data = smoking_1990) +
  aes(x = continent, y = lung_cancer_pct) +
  geom_boxplot(fill = "lightblue") +
  geom_point(alpha = 0.3)

Bonus: Too many overlapping points for alpha to work?

We have many observations/data points, so even making the points transparent doesn’t really help us see them! Another option is to “jitter” the points. This adds some random variation to the position of the points so you can see them better. We can do this using geom_jitter().

WARNING!!! geom_jitter() changes the position of points and should therefore only be used for discrete variables that don’t have numerical values!!!

Since we are plotting a discrete value on the x axis, and a continuous value on the y axis, we will need to tell geom_jitter() not to change the y value positions. We can do this by setting height = 0 inside the geom. We will also modify the degree to which points are jittered on the x axis by setting the width argument. Feel free to play around with width to get a plot that you like. Remember, we can only do this because the x axis is discrete!

ggplot(data = smoking_1990) +
  aes(x = continent, y = lung_cancer_pct) +
  geom_boxplot(fill = "lightblue") +
  geom_jitter(alpha = 0.3, height = 0, width = 0.05)

That looks better!

Predicting output

What do you think will happen if you switch the order of geom_boxplot() and geom_point()? Why? Test it out to see if you were right.

Solution

Since we plot the geom_point() layer first, the boxplot layer is placed on top of the geom_point() layer, so we cannot see a lot of the points.

ggplot(data = smoking_1990) +
  aes(x = continent, y = lung_cancer_pct) +
  geom_point(, alpha = 0.3) +
  geom_boxplot(fill = "lightblue")

Going back to having the points on top, let’s color the points by continent. If we add a color aesthetic to the plot, then both the boxplot and the points are colored by continent:

ggplot(data = smoking_1990) +
  aes(x = continent, y = lung_cancer_pct, color = continent) +
  geom_boxplot(fill = "lightblue") +
  geom_point(alpha = 0.3)

So how do we make it so that just the points are colored but not the boxplots? Each layer can have it’s own set of aesthetic mappings. So far we’ve been using aes() outside of the other functions. When we do this, we are setting the “default” aesthetic mappings for the plot. But we can also set the asethetics inside the specific geom that we want to change. To do that, you can place an additional aes() inside of that layer:

ggplot(data = smoking_1990) +
  aes(x = continent, y = lung_cancer_pct) +
  geom_boxplot(fill = "lightblue") +
  geom_point(aes(color = continent), alpha = 0.3)

Nice! Both geom_boxplot() and geom_point() will inherit the default values of aes(continent, lung_cancer_pct) in the base plot, but only geom_jitter will also use aes(color = continent).

Bonus: Aesthetics inside the ggplot() function

Instead of mapping our aesthetics to each geom, we can provide default aesthetics by passing the values to the ggplot() function call. Any aesthetics we want to be specific to a layer, we would keep in the geom function for that layer:

ggplot(data = smoking_1990, mapping = aes(x = continent, y = lung_cancer_pct)) +
  geom_boxplot(fill = "lightblue") +
  geom_point(aes(color = continent), alpha = 0.3)

Here, both geom_boxplot() and geom_point() will inherit the default values of aes(continent, lung_cancer_pct) in the base plot, but only geom_point() will also use aes(color = continent).

Bonus Exercise: Make your own violin plot

Now create a violin plot comparing percent of people in a country who smoke by continent.

If you have extra time, customize your plot however you want. If there’s something you want to do but don’t know how, try searching on the internet for it.

Solution

ggplot(data = smoking_1990) +
  aes(x = continent, y = smoke_pct) +
  geom_violin()

Univariate Plots

Back to top

We jumped right into make plots with multiple columns. But what if we wanted to take a look at just one column? This can be really useful if we want to understand how certain continuous exposures or outcomes are distributed in our dataset. In that case, we only need to specify a mapping for x and choose an appropriate geom.

Univariate continuous

Let’s start with a histogram to see the range and spread of the lung cancer rates:

ggplot(smoking_1990) +
  aes(x = lung_cancer_pct) +
  geom_histogram()
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

This plot shows us that many of the lung cancer rates in our dataset are really low (less than 0.025%), but there are some outliers with higher rates. Another word for data with this shape is right-skewed, because it has a long tail on the right side of the histogram.

When you ran the code to make the histogram, you should not only see the plot in the plot window, but also a message telling you to choose a better bin value. Histograms can look very different depending on the number of bars you decide to draw. The default is 30. Let’s try setting a different value by explicitly passing a bin= argument to the geom_histogram later.

ggplot(smoking_1990) +
  aes(x = lung_cancer_pct) +
  geom_histogram(bins=20)

Try different values like 5 or 50 to see how the plot changes.

Bonus Exercise: One variable plots

Rather than a histogram, choose one of the other geometries listed under “One Variable” continuous plots on the ggplot cheat sheet.

Example solution

ggplot(smoking_1990) +
  aes(x = lung_cancer_pct) +
  geom_density()

Univariate discrete

What if we want to plot a univariate discrete variable, like continent? For this, we can use a bar chart.

Exercise: Discrete univariate plots

Create a bar plot of continent that shows the number of data points we have for each continent. You can try guessing the geom or look it up on the cheat sheet or Internet. Which continents have the most and fewest countries? How can you tell?

Example solution

ggplot(smoking_1990) +
  aes(x = continent) +
  geom_bar()

Africa has the most countries and Oceania has the fewest. We can tell this because Africa has the highest bar and Oceania has the lowest.

Facets

Back to top

If you have a lot of different columns to try to plot or have distinguishable subgroups in your data, a powerful plotting technique called faceting might come in handy. When you facet your plot, you basically make a bunch of smaller plots and combine them together into a single image. Luckily, ggplot makes this very easy. Let’s start with the histogram that we were just working with:

ggplot(smoking_1990) +
  aes(x = lung_cancer_pct) +
  geom_histogram(bins=20)

Now, let’s draw a separate box for each continent. We can do this with facet_wrap()

ggplot(smoking_1990) +
  aes(x = lung_cancer_pct) +
  geom_histogram(bins=20) +
  facet_wrap(vars(continent))

Now, it’s easier to see the patterns within and between continents.

Note that facet_wrap requires an extra helper function called vars() in order to pass in the column names. It’s a lot like the aes() function, but it doesn’t require an aesthetic name. We can see in this output that we get a separate box with a label for each continent so that only the values for that continent are in that box.

The other faceting function ggplot provides is facet_grid(). The main difference is that facet_grid() will make sure all of your smaller boxes share a common axis. In this example, we will put the boxes into columns side-by-side so that their y axes all line up. We can do this using the cols argument inside facet_grid.

ggplot(smoking_1990) +
  aes(x = lung_cancer_pct) +
  geom_histogram(bins=20) +
  facet_grid(cols = vars(continent))

Unlike the facet_wrap output where each box got its own x and y axis, with facet_grid(), there is only one y axis along the left.

Exercise: Faceting

Facet the scatter plot we made as our first plot by continent. Are there differences in correlation between continents?

Solution

You can copy all the code from the first plot, or you can use the saved variable that we made above and add to that:

cancer_v_smoke +
  facet_wrap(vars(continent))

There don’t seem to be many differences between continents.

Bonus Exercise: Practice saving

Store the plot you made above in an object named my_plot, and save the plot using ggsave().

Example solution

my_plot <- cancer_v_smoke +
  facet_wrap(vars(continent))

ggsave("cancer_v_smoke_faceted.jpg", plot = my_plot, width=6, height=4)

Plot Themes

Back to top

Our plots are looking pretty nice, but what’s with that grey background? While you can change various elements of a ggplot object manually (background color, grid lines, etc.) the ggplot package also has a bunch of nice built-in themes to change the look of your graph. For example, let’s try adding theme_bw() to our histogram:

ggplot(smoking_1990) +
  aes(x = lung_cancer_pct) +
  geom_histogram(bins=20) +
  facet_grid(cols = vars(continent)) +
  theme_bw()

Try out a few other themes, to see which you like: theme_classic(), theme_linedraw(), theme_minimal().

Rotating x axis labels

Often, you’ll want to change something about the theme that you don’t know how to do off the top of your head. When this happens, you can do an Internet search to help find what you’re looking for. To practice this, search the Internet to figure out how to rotate the x axis labels 90 degrees. Then try it out using the histogram plot we made above.

Solution

ggplot(smoking_1990) +
 aes(x = lung_cancer_pct) +
 geom_histogram(bins=20) +
 facet_grid(cols = vars(continent)) +
 theme_bw() +
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))

Plotting for data exploration recap

Back to top

We learned a lot in this lesson! Let’s go over the key points:

  1. ggplot is a powerful way to make plots.
  2. ggplot is all about layering - you can layer different geometries, aesthetics, labels, and other information onto your plots.
  3. You can customize the color, size, shape, and theme of your plots.
  4. ggplot allows you to easily save publication-quality plots.
  5. There are lots of different plot types. Some of the most useful ones are:
    1. Scatter plots (for two continuous variables).
    2. Boxplots (for one discrete and one continuous variable).
    3. Histograms (for one continuous variable).
    4. Bar plots (for one discrete variable).
  6. Faceting allows you to easily make the same plot separated by a discrete variable of interest.

With the skills you’ve learned here, you’re now ready to start doing your own data exploration!

Applying it to your own data

Back to top

Now that we’ve learned how impactful effective data visualization can be, and how to create informative visuals in R, it’s time for you to start thinking about what you want to do with your own data!

Discuss with your group your data and what type of exploratory data analysis you would like to perform.

Questions to answer:

  1. In 1-2 sentences, describe the information covered in your dataset.
  2. How large is your dataset? How many rows? How many columns?
  3. Write down 3 specific questions you could answer with your data set.
  4. Think through 3 data visualizations that can answer the questions you have. Specifically, for each one: a) Write down your question or goal for the plot. b) Write down the variables needed to answer your question. c) Choose a geometry. d) Choose an aesthetic for each variable. e) Draw a draft plot with pen and paper to determine whether you think these choices will work.

Glossary of terms

Back to top

Key Points

  • Use read_csv() to read tabular data in R.

  • Geometries are the visual elements drawn on data visualizations (lines, points, etc.), and aesthetics are the visual properties of those geometries that are assigned to variables in the data (color, position, etc.).

  • Use ggplot() and geoms to create data visualizations, and save them using ggsave().