Introduction to The DiscovR Workshop

Overview

Teaching: 15 min
Exercises: 0 min

Questions

Who is the workshop for?

What will the workshop cover?

What else do I need to know about the workshop?

Objectives

Set expectations.

Meet each other.

Introduce the workshop goals.

Go over logistics.

DiscovR stands for “Data integration: strategies, concepts, and visuals in R”

Who is this workshop for?

This workshop is for data managers and others working with data who are interested in learning the foundations of data science and coding in R so you can use it in your own work. We believe everyone can learn to code, and that a lot of you will find it very useful for things such as data analysis and plotting.

This workshop is targeted to absolute beginners, and we expect that you have zero data science or coding experience coming in. That being said, you’re welcome to attend the workshop if you already have a coding background but want to learn more!

To provide an inclusive learning environment, we follow The Carpentries Code of Conduct. We expect that instructors, facilitators, and learners abide by this code of conduct, including practicing the following behaviors:

Use welcoming and inclusive language.
Be respectful of different viewpoints and experiences.
Gracefully accept constructive criticism.
Focus on what is best for the community.
Show courtesy and respect towards other community members.

Introducing the instructors and facilitators

Now that you know a little about The Carpentries as an organization, the instructors and facilitators will introduce themselves and what they’ll be teaching/helping with.

Introducing participants

Introduce yourself with your preferred name, role, affiliation, work/research area, and Kenyan name and meaning.

What will the workshop cover?

This workshop will introduce you to exploratory data analysis and effective data visualiation, and how to implement these concepts using the R programming language.

While we will focus primarily on public health applications, what you learn here are programs that are used everyday in computational workflows in diverse fields: microbiology, statistics, neuroscience, genetics, the social and behavioral sciences, such as psychology, economics, and many others.

A workflow is a set of steps to read data, analyze it, and produce numerical and graphical results to support an assertion or hypothesis encapsulated into a set of computer files that can be run from scratch on the same data to obtain the same results. This is highly desirable in situations where the same work is done repeatedly – think of processing data from an annual survey. It is also desirable for reproducibility, which enables you and other people to look at what you did and produce the same results later on. It is increasingly common for people to publish scientific articles along with the data and computer code that generated the results discussed within them.

The programs we will use are:

R: a statistical analysis and data management program,
RStudio: a graphical interface to use R, and
R Markdown: a method for writing reproducible reports.

We’ll use these tools to manage data, perform basic statistical analyses, and make pretty plots!

While the workshop won’t make you an expert, we hope to provide you with a foundational understanding in coding for data analysis and visualization, automating your work, and creating reproducible programs. We also hope to provide you with some fundamentals that you can incorporate in your own work.

At the end, we provide links to resources you can use to learn about these topics in more depth than this workshop can provide.

Asking questions and getting help

One last note before we get into the workshop.

If you have general questions about a topic, please raise your hand to ask it. The instructor will definitely be willing to answer!

For more specific nitty-gritty questions about issues you’re having individually, we use sticky notes to indicate whether you are on track or need help. We’ll use these throughout the workshop to help us determine when you need help with a specific issue (a facilitator will come help), whether our pace is too fast, and whether you are finished with exercises. If you indicate that you need help because, for instance, you get an error in your code (e.g. red sticky), a facilitator will come help you figure things out. Feel free to also call facilitators over through a hand wave if we don’t see your sticky!

Other miscellaneous things

If you’re in person, we’ll tell you where the bathrooms are! Also let us know if there are any accommodations we can provide to help make your learning experience better.

Key Points

We follow The Carpentries Code of Conduct.

Our fundamental goal is to become more comfortable exploring and working with data.

Our workshop goal is to write a sharable and reproducible report.

This lesson content is targeted to absolute beginners with no data science or coding experience.

Getting Started with R

Overview

Teaching: 75 min
Exercises: 30 min

Questions

What are R and R Studio?

How do I perform tasks and store information?

Objectives

To become oriented with R and R Studio.

To learn about functions and objects.

Introduction to R and RStudio
Foundational topics
1. Functions
2. Objects
Glossary of terms

Why learn to program?

Share why you’re interested in learning how to code.

Solution:

There are lots of different reasons, including to perform data analysis and generate figures. I’m sure you have more specific reasons for why you’d like to learn!

Introduction to R and RStudio

Back to top

To perform exploratory analyses, we need the data we want to explore and a platform to analyze the data.

You already have the data. But what platform will we use to analyze the data? We have many options!

We could try to use a spreadsheet program like Microsoft Excel or Google sheets that have limited access, less flexibility, and don’t easily allow for things that are critical to “reproducible” research, like easily sharing the steps used to explore and make changes to the original data.

We could also use a program like SAS or STATA, which are used by many epidemiologists. However, these programs are not freely available, the graphics are not as customizable, and there are not a ton of specialized packages for different niche analyses.

Instead, we’ll use a more general programming language to test our hypothesis. Today we will use R, but we could have also used Python for the same reasons we chose R. Both R and Python are freely available, the instructions you use to do the analysis are easily shared, and by using reproducible practices, it’s straightforward to add more data or to change settings like colors or the size of a plotting symbol.

Bonus: But why R and not Python?

There’s no great reason. Although there are subtle differences between the languages, it’s ultimately a matter of personal preference. Both are powerful and popular languages that have very well developed and welcoming communities of scientists that use them. As you learn more about R, you may find things that are annoying in R that aren’t so annoying in Python; the same could be said of learning Python. If the community you work in uses R, then you’re in the right place.

To run R, all you really need is the R program, which is available for computers running the Windows, Mac OS X, or Linux operating systems. You installed R while getting set up for this workshop.

To make your life in R easier, there is a great (and free!) program called RStudio that you also installed and used during set up. As we work today, we’ll use features that are available in RStudio for writing and running code, managing projects, installing packages, getting help, and much more. It is important to remember that R and RStudio are different, but complementary programs. You need R to use RStudio.

To get started, we’ll spend a little time getting familiar with the RStudio environment and setting it up to suit your tastes. When you start RStudio, you’ll have three panels.

On the left you’ll have a panel with three tabs - Console, Terminal, and Jobs. The Console tab is what running R from the command line looks like. This is where you can enter R code. Try typing in 2+2 at the prompt (>).

In the upper right panel are tabs indicating the Environment, History, and a few other things. In the lower right panel are tabs for Files, Plots, Packages, Help, and Viewer. We’ll spend more time in each of these tabs as we go through the workshop, so we won’t spend a lot of time discussing them now.

Let’s get going on our analysis!

One of the helpful features in RStudio is the ability to create a project. A project is a special directory that contains all of the code and data that you will need to run an analysis.

At the top of your screen you’ll see the “File” menu. Select that menu and then the menu for “New Project…”.

When the smaller window opens, select “Existing Directory” and then the “Browse” button in the next window.

Navigate to the directory that contains your code and data from the setup instructions and click the “Open” button.

Then click the “Create Project” button.

Did you notice anything change?

In the lower right corner of your RStudio session, you should notice that your Files tab is now your project directory. You’ll also see a file called un-report.Rproj in that directory.

From now on, you should start RStudio by double clicking on that file. This will make sure you are in the correct directory when you run your analysis.

We’d like to create a file where we can keep track of our R code.

Back in the “File” menu, you’ll see the first option is “New File”. Selecting “New File” opens another menu to the right and the first option is “R Script”. Select “R Script”.

Now we have a fourth panel in the upper left corner of RStudio that includes an Editor tab with an untitled R Script. Let’s save this file as intro_to_r.R in our project directory.

We will be entering R code into the Editor tab to run in our Console panel.

On line 1 of intro_to_r.R, type 2+2.

With your cursor on the line with the 2+2, press Ctrl+Enter on your keyboard. You should be able to see that 2+2 was run in the Console. (You can also click Run in the top right side of the Editor, but this isn’t quite as easy.)

As you write more code, you can highlight multiple lines and then press Ctrl+Enter, orclick Run, to run all of the lines you have selected.

Comments

Sometimes you may want to write comments in your code to help you remember what your code is doing, but you don’t want R to think these comments are a part of the code you want to evaluate. That’s where comments come in! Anything after a # symbol in your code will be ignored by R:
# this is a comment

Foundational topics

Functions

Functions are built-in procedures that automate a task for you. You input arguments into a function and the function returns a value. We’ll go over a few math functions to get our feet wet.

You call a function in R by typing it’s name followed by opening then closing parenthesis. Each function has a purpose, which is often hinted at by the name of the function.

Let’s start with the sqrt() function.

Let’s try to run the function without anything inside the parenthesis.

sqrt()

Error in sqrt(): 0 arguments passed to 'sqrt' which requires 1

We get an error message. Don’t panic! Error messages pop up all the time, and can be super helpful in debugging code.

In this case, the message tells us zero arguments were passed to the function, but we need to input at least one. Many functions, including sqrt(), require additional pieces of information to do their job. We call these additional values “arguments” or “parameters.” You pass arguments to a function by placing values in between the parenthesis. A function takes in these arguments and works behind the scenes to output something we’re interested in.

For example, we want to provide a number to sqrt(), namely the number we want the square root of:

sqrt(4)

[1] 2

Here, the input argument is 4, and the output is 2, just like we’d expect.

Now let’s do an example where we might not know the expected output:

sqrt(2)

[1] 1.414214

Great, now let’s move onto a slightly more complicated function. If we want to round a number, we can use the round() function:

round(3.14159)

[1] 3

Why did this round to three? What if we want it to round to a different number of digits?

Pro-tip

Each function has a help page that documents what arguments the function expects and what value it will return. You can bring up the help page a few different ways. If you have typed the function name in the Editor windows, you can put your cursor on the function name and press F1 to open help page in the Help viewer in the lower right corner of RStudio. You can also type ? followed by the function name in the console.

For example, try running ?round in the console. A help page should pop up with information about what the function is used for and how to use it, as well as useful examples of the function in action. As you can see, round() has two arguments: the numeric input and the number of digits to round to.

We can use the digits argument in round() to change how many decimal places are kept:

round(3.14159, digits = 2)

[1] 3.14

Sometimes it is helpful - or even necessary - to include the argument name, but often we can skip the argument name, if the argument values are passed in the order they are defined:

round(3.14159, 2)

[1] 3.14

Position of the arguments in functions

Which of the following lines of code will give you an output of 3.14? For the one(s) that don’t give you 3.14, what do they give you?

round(x = 3.1415)

round(x = 3.1415, digits = 2)

round(digits = 2, x = 3.1415)

round(2, 3.1415)

round(3.14159265, 2)

Solution

The 1st line will give you 3 because the default number of digits is 0.

The 2nd and 3rd lines will give you the right answer because the arguments are named, and when you use names the order doesn’t matter.

The 4th line will give you 2 because, since you didn’t name the arguments, x=2 and digits=3.1415.

The 5th line will also give you the right answer because the arguments are in the correct order.

Bonus Exercise: taking logarithms

Calculate the following:

Natural log (ln) of 10

Log base 10 of 10 (challenge: try to do this 2 different ways), and

Log base 3 of 10
Solution
# natural log (ln) of 10
log(10)
# log base 10 of 10
log10(10)
log(10, base = 10)
# log base 3 of 10
> log(10, base = 3)

If all this function stuff sounds confusing, don’t worry! We’ll see a bunch of examples as we go that will make things clearer.

Objects

Sometimes we want to store information for later use or transformation. To do this in R, we store the information, or object, in a variable name that you can think of like a storage box.

Let’s say we want to round the square root of a number. One way we can do this is to put a function inside a function:

round(sqrt(2), 2)

[1] 1.41

Another way is to store the square root output first, and then round that.

To store an object for later, we first have to decide on a name of the box we want to store it in. Let’s say we want to call it square_root. Then we have to tell R what we want to put in the object name. We use the <- symbol, which is the assignment operator to assign values generated or typed on the right to object names on the left. An alternative symbol that you might see used as an assignment operator is the = but it is clearer to only use <- for assignment. We use this symbol so often that RStudio has a keyboard short cut for it: Alt+- on Windows, and Option+- on Mac.

Let’s assign sqrt(2) to the object square_root. We can see that square_root contains the square root of 2:

square_root <- sqrt(2)
square_root

[1] 1.414214

In R terms, square_root is a named object that references or stores something. In this case, square_root stores the square root of 2.

Notice that we also have a new value in our environment in the upper right hand corner of RStudio. This panel lists all of the objects that we have stored in our environment, it’s kind of like a view into our storage room (environment) of all the boxes (objects) of things we have access to.

Now let’s round the square root of 2 to 2 decimal places:

sqrt_rounded <- round(square_root, 2)
sqrt_rounded

[1] 1.41

This is a fairly straightforward example, but you’ll see the usefulness of storing things in variables as the workshop progresses.

Now, what happens to sqrt_rounded if we update square_root?

square_root <- sqrt(4)
square_root

[1] 2

sqrt_rounded

[1] 1.41

It doesn’t update! That’s because we haven’t re-run the code that rounded square_root. The values don’t update automatically like in a spreadsheet.

Predicting object contents

What is my_number after these three lines are run?
my_number <- 10
my_number + 5
my_number <- my_number + 7
10

15

17

22

Solution

The answer is 17 because 10 is stored in my_number in the first line, 15 is printed after the second line but is not stored so my_number remains 10, and then 7 is added to my_number in the third line, making 17. If we ran the third line again, my_number would be 24. Because the object value changes depending on the number of times we run the final line, in most cases it’s best practice to not overwrite objects like this.

Guidelines on naming objects

They cannot start with a number (2x is not valid, but x2 is) or have special characters.

R is case sensitive, so for example, weight is different from Weight.

You cannot use spaces in the name.

There are some names that cannot be used because they are the names of fundamental functions in R (e.g., if, else, for; see here for a complete list). If in doubt, check the help to see if the name is already in use (?function_name).

Bonus Exercise: Bad names for objects

Try to assign values to some new variable names. What do you notice? After running all four lines of code below, what value do you think the variable Flower holds?
1number <- 3
Flower <- "marigold"
flower <- "rose"
favorite number <- 12
Solution

Notice that we get an error when we try to assign values to 1number and favorite number. This is because we cannot start an object name with a numeral and we cannot have spaces in object names. The object Flower still holds “marigold.” This is because R is case-sensitive, so running flower <- "rose" does NOT change the Flower object. This can get confusing, and is why we generally avoid having objects with the same name and different capitalization.

Getting unstuck

Sometimes you may accidentally run a line of code that isn’t quite complete yet. For instance:
my_number <- 
What happens when you run this? In your console at the bottom of your screen, you may see a + instead of a > at the beginning of the line. This means that R is waiting for more information. In this case, it’s because it doesn’t know what you want to store in my_number. You can do one of two things if this happens - finish the command you want to type (e.g. by entering a number), or hit the escape key to get unstuck.

Quotes vs. No Quotes

Let’s say we wanted to print out a word:
tree
Error in eval(expr, envir, enclos): object 'tree' not found
You’ll notice that we get an error, that the object ‘tree’ is not found. This is because R is looking for an object called tree. But what we really want is to just print out the word “tree”. To do this, we put the word in quotes (single or double) so R knows that it’s not an object it needs to look for:
"tree"
[1] "tree"

Glossary of terms

Back to top

Comments: lines or parts of lines that are not run. In R, comments start with a #.
Function: takes input and generates output.
Object: way to store information for later use and manipulation.

Key Points

R is a free programming language used by many for reproducible data analysis.

Functions allow you to perform complex tasks.

Objects allow you to store information.

R for Plotting

Overview

Teaching: 90 min
Exercises: 30 min

Questions

How do I read data into R?

What are geometries and aesthetics?

How can I use R to create and save professional data visualizations?

Objectives

To be able to read in data from csv files.

To create plots with both discrete and continuous variables.

To understand mapping and layering using ggplot2.

To be able to modify a plot’s color, theme, and axis labels.

To be able to save plots to a local directory.

The “goal” of the workshop
Overview of the lesson
Directory structure
Loading and reviewing data
Our first plot
Plotting for data exploration
Applying it to your own data
Glossary of terms

The “goal” of the workshop

Our goal is to write a report to the United Nations on the relationship between lung cancer, smoking, and air pollution. In other words, we are going to analyze how countries’ smoking rates and air pollution may be related to the percent of people with lung cancer.

To get to that point, we’ll need to learn how to manage data, make plots, and generate reports. The next section discusses in more detail exactly what we will cover.

Overview of the lesson

In this lesson, we will go over how to read tabular data into R (e.g. from a csv file) and plot it for exploratory data analysis.

Exercise: Create a new R Script file

We would like to create a file where we can keep track of our R code. On your own, create a file called plotting.R in the project directory.

Solution

Navigate to the “File” menu in RStudio. You’ll see the first option is “New File.” Selecting “New File” opens another menu to the right, and the first option is “R Script.” Select “R script.” Alternatively, you can click on the white square button with a green plus sign in the upper left corner and select “R Script.”

Now you have an untitled R Script in your Editor tab. Save this file as plotting.R in our project directory.

Directory structure

Exercise: File organization

When you’re working on a project, how do you organize your files?

Take a look at your un-report directory. You should be able to see it in the bottom right side of your screen under the “Files” tab. What folders are there, and why do you think they’re there?

Solution

There are lots of different ways to organize files, but you should have some consistent method of organizing them so that it’s easy to find what you want. If you have all of your files in one folder, it can get kind of confusing to find what you need. In the un-report directory there are three “sub-directories”: data, figures, and reports.

data contains all of the data that we will need for the workshop.

figures is where we will save the figures we generate during the workshop.

reports is where we will save our final report.

Loading and reviewing data

The tidyverse vs Base R

If you’ve used R before, you may have learned commands that are different than the ones we will be using during this workshop. We will be focusing on functions from the tidyverse. The “tidyverse” is a collection of R packages that have been designed to work well together and offer many convenient features that do not come with a fresh install of R (aka “base R”). These packages are very popular and have a lot of developer support including many staff members from RStudio. These functions generally help you to write code that is easier to read and maintain. We believe learning these tools will help you become more productive more quickly.

First, we’re going to load the tidyverse package. Packages are useful because they contain pre-made functions to do specific tasks. Tidyverse contains a set of functions that makes it easier for us to do complex analyses and create professional visualizations in R. The way we access all of these useful functions is by running the following command:

library(tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.2     ✔ readr     2.1.4
✔ forcats   1.0.0     ✔ stringr   1.5.0
✔ ggplot2   3.4.2     ✔ tibble    3.2.1
✔ lubridate 1.9.2     ✔ tidyr     1.3.0
✔ purrr     1.0.1     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

When you loaded the tidyverse package, you probably got a message like the one we got above. This isn’t an error! These messages are just giving you more information about what happened when you loaded tidyverse. For now, we don’t have to worry about the messages. You can read the bonus section below for more details.

Bonus: What’s with all those messages???

The tidyverse messages give you more information about what happened when you loaded tidyverse. The tidyverse is actually a collection of several different packages, so the first section of the message tells us what packages were installed when we loaded tidyverse (these include ggplot2, which we’ll be using a lot in this lesson, and dyplr, which you’ll be introduced to tomorrow in the R for Data Analysis lesson).

The second section of messages gives a list of “conflicts.” Sometimes, the same function name will be used in two different packages, and R has to decide which function to use. For example, our message says that:
dplyr::filter() masks stats::filter()
This means that two different packages (dyplr from tidyverse and stats from base R) have a function named filter(). By default, R uses the function that was most recently loaded, so if we try using the filter() function after loading tidyverse, we will be using the filter() function from dplyr().

Okay, now let’s read in our data, smoking_cancer_1990.csv. To do this, we need to know the file path, which tells R where to find the file on your computer. When you have a project open in R, it starts looking from your main project folder, in our case un-report. Inside un-report, we have a folder called data, and in that folder is the smoking_cancer_1990.csv file. This is the file that contains the data that we want to plot. So the file path from our main project directory is: data/smoking_cancer_1990.csv. The / tells R that the file is in the data directory.

We’re going to use the read_csv() function that we loaded in with the tidyverse, and save it to smoking_1990, which will act as a placeholder for our data. This function takes a file path and returns a tibble, which is basically a table (that we sometimes call a data frame…).

smoking_1990 <- read_csv("data/smoking_cancer_1990.csv")

Rows: 191 Columns: 6
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (2): country, continent
dbl (4): year, pop, smoke_pct, lung_cancer_pct

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

A few things printed out to the screen: it tells us how many rows and columns are in our data, and information about each of the columns. Each row contains the continent (“continent”), the total population (“pop”), the age-standardized percent of people who smoke (“smoke_pct”), and the age-standardized percent of people who have lung cancer (“lung_cancer_pct”) for a given country (“country”). We can see that two of the columns are characters (categorical variables), and three are doubles (numbers).

Bonus: Characters vs. factors

Note: In anything before R 4.0, categorical variables used to be read in as factors, which are special data objects that are used to store categorical data and have limited numbers of unique values. The unique values of a factor are tracked via the “levels” of a factor. A factor will always remember all of its levels even if the values don’t actually appear in your data. The factor will also remember the order of the levels and will always print values out in the same order (by default this order is alphabetical).

If your columns are stored as character values but you need factors for plotting, ggplot will convert them to factors for you as needed.

Now let’s look at the data a bit more. In the Environment tab in the upper right corner of RStudio, you will now see smoking_1990 listed. If you click on it, it will pop up in a tab next to your script.

After we’ve reviewed the data, you’ll want to make sure to click the tab in the upper left to return to your plotting.R file so we can start writing some code.

Another way to look at the data is to print it out to the console:

smoking_1990

# A tibble: 191 × 6
    year country             continent          pop smoke_pct lung_cancer_pct
   <dbl> <chr>               <chr>            <dbl>     <dbl>           <dbl>
1990 Afghanistan         Asia          12412311      3.12          0.0127
1990 Albania             Europe         3286542     24.2           0.0327
1990 Algeria             Africa        25758872     18.9           0.0118
1990 Andorra             Europe           54508     36.6           0.0609
1990 Angola              Africa        11848385     12.5           0.0139
1990 Antigua and Barbuda North America    62533      6.80          0.0105
1990 Argentina           South America 32618648     30.4           0.0344
1990 Armenia             Europe         3538164     30.5           0.0441
1990 Australia           Oceania       17065100     29.3           0.0599
1990 Austria             Europe         7677850     35.4           0.0439
# ℹ 181 more rows

The read_csv() function took the file path we provided, did who-knows-what behind the scenes, and then outputted an R object with the data stored in that csv file. All that, with one short line of code!

Data objects

There are many different ways to store data in R. Most objects have a table-like structure with rows and columns. We will refer to these objects generally as “data objects”. If you’ve used R before, you many be used to calling them “data frames”. Functions from the tidyverse such as read_csv() work with objects called “tibbles”, which are a specialized kind of “data frame.” Another common way to store data is a “data table”. All of these types of data objects (tibbles, data frames, and data tables) can be used with the commands we will learn in this lesson to make plots. We may sometimes use these terms interchangeably.

Bonus Exercise: Reading in an excel file

Say you have an excel file and not a csv - how would you read that in? Hint: Use the Internet to help you figure it out!

Solution

One way is using the read_excel function in the readxl package. Hint: you may need to use install.packages() to install the readxl package. There are other ways to read in excel files, but this is our preferred method because the output will be the same as the output of read_csv.

Our first plot

Creating our first plot

Back to top

We will be using the ggplot2 package, which is part of the tidyverse, to make our plots. This is a very powerful package that creates professional looking plots and is one of the reasons people like using R so much.

When making a plot, you first have to come up with a question you wish to answer related to your data. Here, we are interested in whether there is a relationship between the percent of people who smoke and the percent of people with lung cancer.

What do we want to plot?

Given that we are interested in whether there is a relationship between the percent of people who smoke and the percent of people with lung cancer:

What variables would you want to put on the x and y axes?

What columns do those variables correspond to in our smoking_1990 dataset?

What type of plot would you want to make?

Hint: take a look at this blog post if you need ideas about which plot type might be good.

Solution

A scatter plot with percent of people who smoke (smoke_pct column in our data) on the x axis and percent of people with lung cancer (lung_cancer_pct column in our data) on the y axis will allow you to visualize the correlation between these two variables.

Now that we’ve figured out what we want to plot and what columns in our dataset we need to use, let’s get started! All plots made using the ggplot2 package start by calling the ggplot() function. In the tab you created for the plotting.R file, type the following:

ggplot(data=smoking_1990)

To run code that you’ve typed in the editor, you have a few options. Remember that the quickest way to run the code is by pressing Ctrl+Enter on your keyboard. This will run the line of code that currently contains your cursor or any highlighted code.

When we run this code, the Plots tab will pop to the front in the lower right corner of the RStudio screen. Right now, we just see a big grey rectangle.

What we’ve done is created a ggplot object and told it we will be using the data from the smoking_1990 object that we’ve loaded into R. We’ve done this by calling the ggplot() function with smoking_1990 as the data argument.

So we’ve made a plot object, now we need to start telling it what we actually want to draw on this plot.

The elements of a plot have a bunch of properties like an x and y position, a size, a color, etc. When creating a data visualization, we can map variables in our dataset to these properties, called aesthetics, in our plot. In ggplot, we can do this by creating an “aesthetic mapping”, which we do with the aes() function.

To create our plot, we need to map variables from our smoking_1990 object to ggplot aesthetics using the aes() function. Since we have already told ggplot that we are using the data in the smoking_1990 object, we can access the columns of smoking_1990 using the object’s column names. (Remember, R is case-sensitive, so we have to be careful to match the column names exactly!)

Let’s start by telling our plot object that we want to map our smoking values to the x axis of our plot.
We do this by adding (+) information to our plot object. Add this new line to your code and run both lines by highlighting them and pressing Ctrl+Enter on your keyboard:

ggplot(data = smoking_1990) +
  aes(x = smoke_pct)

Note that we’ve added this new function call to a second line just to make it easier to read. To do this we make sure that the + is at the end of the first line otherwise R will assume your command ends when it starts the next row. The + sign indicates not only that we are adding information, but to continue on to the next line of code.

Observe that our Plot window is no longer a grey square. We now see that we’ve mapped the smoke_pct column to the x axis of our plot. Note that that column name isn’t very pretty as an x-axis label, so let’s add the labs() function to make a nicer label for the x axis:

ggplot(data = smoking_1990) +
  aes(x = smoke_pct) +
  labs(x = "Percent of people who smoke")

Quotes vs. No Quotes (refresher)

Notice that when we added the label value we did so by placing the values inside quotes. This is because we are not using a value from inside our data object - we are providing the name directly. When you need to include actual text values in R, they will be placed inside quotes to tell them apart from other object or variable names.

The general rule is that if you want to use values from the columns of your data object, then you supply the name of the column without quotes, but if you want to specify a value that does not come from your data, then use quotes.

Mapping lung cancer rates to the y axis

Map our lung_cancer_pct values to the y axis and give them a nice label.

Solution

ggplot(data = smoking_1990) +
  aes(x = smoke_pct) +
  labs(x = "Percent of people who smoke") +
  aes(y = lung_cancer_pct) +
  labs(y = "Percent of people with lung cancer")

Excellent. We’ve now told our plot object where the x and y values are coming from and what they stand for. But we haven’t told our object how we want it to draw the data. There are many different plot types (bar charts, scatter plots, histograms, etc). We tell our plot object what to draw by adding a geometry (“geom” for short) to our object. We will talk about many different geometries today, but for our first plot, let’s draw our data using the “points” geometry for each value in the data set. To do this, we add geom_point() to our plot object:

ggplot(data = smoking_1990) +
  aes(x = smoke_pct) +
  labs(x = "Percent of people who smoke") +
  aes(y = lung_cancer_pct) +
  labs(y = "Percent of people with lung cancer") +
  geom_point()

Now we’re really getting somewhere. It finally looks like a proper plot! We can now see a trend in the data. It looks like countries with greater smoking rates tend to have higher lung cancer rates, though it’s important to remember that we can’t infer causality from this plot alone.

Let’s add a title to our plot to make that clearer. Again, we will use the labs() function, but this time we will use the title = argument.

ggplot(data = smoking_1990) +
  aes(x = smoke_pct) +
  labs(x = "Percent of people who smoke") +
  aes(y = lung_cancer_pct) +
  labs(y = "Percent of people with lung cancer") +
  geom_point() +
  labs(title = "Are lung cancer rates associated with smoking rates?")

No one can deny we’ve made a very handsome plot! We can immediately see that there is a positive association between lung cancer rates and smoking rates. But now looking at the data, we might be curious about learning more about the points that are the extremes of the data. We know that we have two more pieces of data in the smoking_1990 object that we haven’t used yet. Maybe we are curious if the different continents show different patterns in smoking rates and lung cancer rates. One thing we could do is use a different color for each of the continents. To map the continent of each point to a color, we will again use the aes() function.

Color the points by continent

To color the points by continent, you will need to add that to your aesthetic function. Fill in the blank below with the correct column from your data to make this happen.
ggplot(data = smoking_1990) +
  aes(x = smoke_pct) +
  labs(x = "Percent of people who smoke") +
  aes(y = lung_cancer_pct) +
  labs(y = "Percent of people with lung cancer") +
  geom_point() +
  labs(title = "Are lung cancer rates associated with smoking rates?") +
  aes(_____)
What information can you learn from the plot?
Solution
ggplot(data = smoking_1990) +
  aes(x = smoke_pct) +
  labs(x = "Percent of people who smoke") +
  aes(y = lung_cancer_pct) +
  labs(y = "Percent of people with lung cancer") +
  geom_point() +
  labs(title = "Are lung cancer rates associated with smoking rates?") +
 aes(color = continent)
Here we can see that in 1990 the African countries tended to have much lower smoking and lung cancer rates than many other continents.

Notice that when we add a mapping for color, ggplot automatically provided a legend for us. It took care of assigning different colors to each of our unique values of the continent variable. (Note that when we mapped the x and y values, those drew the actual axis labels, so in a way the axes are like the legends for the x and y values).

The colors that ggplot uses are determined by the color “scale”. Each aesthetic value we can supply (x, y, color, etc) has a corresponding scale. Let’s change the colors to make them a bit prettier. While we’re at it, let’s capitalize the legend title too.

ggplot(data = smoking_1990) +
  aes(x = smoke_pct) +
  labs(x = "Percent of people who smoke") +
  aes(y = lung_cancer_pct) +
  labs(y = "Percent of people with lung cancer") +
  geom_point() +
  labs(title = "Are lung cancer rates associated with smoking rates?") +
  aes(color = continent) +
  scale_color_brewer(palette = "Set2") +
  labs(color = "Continent")

The scale_color_brewer() function is just one of many you can use to change colors. There are bunch of “palettes” that are build in. You can view them all by running RColorBrewer::display.brewer.all() or check out the Color Brewer website for more info about choosing plot colors. Check out the bonus exercise below for even more options.

Bonus Exercise: Changing colors

There are lots of ways to change colors when using ggplot. The scale_color_brewer() function is one of many you can use to change colors. There are bunch of “palettes” that are build in. You can view them all by running RColorBrewer::display.brewer.all() or check out the Color Brewer website for more info about choosing plot colors. There are also lots of other fun options:

Viridis

National parks

LaCroix

Wes Anderson

ggsci

Play around with different color palettes. Feel free to install another package and choose one of those if you want. Pick your favorite!
Solution

You can use RColorBrewer::display.brewer.all() to pick a color palette. As a bonus, you can also use one of the packages listed above. Here’s an example:
ggplot(data = smoking_1990) +
  aes(x = smoke_pct) +
  labs(x = "Percent of people who smoke") +
  aes(y = lung_cancer_pct) +
  labs(y = "Percent of people with lung cancer") +
  geom_point() +
  labs(title = "Are lung cancer rates associated with smoking rates?") +
  aes(color = continent) +
  labs(color = "Continent") +
  scale_color_viridis_d(option = "turbo")

Since we have the data for the population of each country, we might be curious about the relationship between population, smoking rates, and lung cancer rates. Do you think larger countries will have a greater or lower lung cancer rate? Let’s find out by mapping the population of each country to the size of our points.

Changing point sizes

Map the population of each country to the size of our points. HINT: Is size an aesthetic or a geometry? If you’re stuck, feel free to Google it, or look at the help menu.

Solution

ggplot(data = smoking_1990) + aes(x = smoke_pct) + labs(x = “Percent of people who smoke”) + aes(y = lung_cancer_pct) + labs(y = “Percent of people with lung cancer”) + geom_point() + labs(title = “Are lung cancer rates associated with smoking rates?”) + aes(color = continent) + scale_color_brewer(palette = “Set2”) + labs(color = “Continent”) ```

There doesn’t seem to be a very strong association with population size. We also got another legend here for size, which is nice, but the values look a bit ugly in scientific notation. Let’s divide all the values by 1,000,000 and label our legend “Population (in millions)”

ggplot(data = smoking_1990) +
  aes(x = smoke_pct) +
  labs(x = "Percent of people who smoke") +
  aes(y = lung_cancer_pct) +
  labs(y = "Percent of people with lung cancer") +
  geom_point() +
  labs(title = "Are lung cancer rates associated with smoking rates?") +
  aes(color = continent) +
  scale_color_brewer(palette = "Set2") +
  labs(color = "Continent") +
  aes(size = pop/1000000) +
  labs(size = "Population (in millions)")

This works because you can treat the columns in the aesthetic mappings just like any other variables and can use functions to transform or change them at plot time rather than having to transform your data first.

Good work! Take a moment to appreciate what a cool plot you made with a few lines of code. To fully view its beauty you can click the “Zoom” button in the Plots tab - it will break free from the lower right corner and open the plot in its own window.

Bonus Exercise: Changing shapes

Instead of (or in addition to) color, change the shape of the points so each continent has a different shape. (I’m not saying this is a great thing to do - it’s just for practice!) HINT: Is shape an aesthetic or a geometry? If you’re stuck, feel free to Google it, or look at the help menu.
Solution

You’ll want to use the aes aesthetic function to change the shape:
ggplot(data = smoking_1990) +
  aes(x = smoke_pct) +
  labs(x = "Percent of people who smoke") +
  aes(y = lung_cancer_pct) +
  labs(y = "Percent of people with lung cancer") +
  geom_point() +
  labs(title = "Are lung cancer rates associated with smoking rates?") +
  aes(color = continent) +
  scale_color_brewer(palette = "Set2") +
  labs(color = "Continent") +
  aes(size = pop) +
  aes(size = pop/1000000) +
  labs(size = "Population (in millions)") +
  aes(shape = continent)

For our first plot we added each line of code one at a time so you could see the exact affect it had on the output. But when you start to make a bunch of plots, we can actually combine many of these steps so you don’t have to type as much. For example, you can collect all the aes() statements and all the labs() together. A more condensed version of the exact same plot would look like this:

ggplot(data = smoking_1990) +
  aes(x = smoke_pct, y = lung_cancer_pct, color = continent, size = pop/1000000) +
  geom_point() +
  scale_color_brewer(palette = "Set2") +
  labs(x = "Percent of people who smoke", y = "Percent of people with lung cancer",
    title = "Are lung cancer rates associated with smoking rates?", color = "Continent", size = "Population (in millions)")

Storing our plot

We learned about how to save things to object names in the previous lesson. We can do the same thing with plots! Store our final plot in an object called cancer_v_smoke.

Solution

 cancer_v_smoke <- ggplot(data = smoking_1990) +
   aes(x = smoke_pct, y = lung_cancer_pct, color = continent, size = pop/1000000) +
   geom_point() +
   scale_color_brewer(palette = "Set2") +
   labs(x = "Percent of people who smoke", y = "Percent of people with lung cancer",
     title = "Are lung cancer rates associated with smoking rates?", color = "Continent", size = "Population (in millions)")

Saving our first plot

Back to top

Let’s say you want to share your plot with friends or co-workers who aren’t running R. It’s wise to keep all the code you used to draw the plot, but sometimes you need to make a PNG or PDF version of the plot so you can share it with your PI or post it to your Instagram story.

To save your plot, you can use the ggsave() function. A few things about ggsave() (the good and the bad):

[The bad] By default, ggsave() will save the last plot you made, but this can get confusing so it’s best to create a plot object and then save the specific plot you’re interested in instead.
[The neutral] The default width and height are sometimes not great options, but you can supply width= and height= arguments to change them (the default values are in inches).
[The good] It will determine the file type based on the name you provide.

Let’s save our plot (with an informative name) as a 4x6 inch png:

ggsave(filename = "figures/cancer_v_smoke.png", plot = cancer_v_smoke, width = 6, height = 4)

Debugging code

Debugging is the process of finding and fixing errors or unexpected outputs in your code. Even well seasoned coders run into bugs all the time.

Here are some strategies of how programmers try to deal with coding errors:

Don’t panic. Bugs are a normal part of the coding process.

If you are getting an error message, read the error message carefully. Unfortunately, not all error messages are well written and it may not be obvious at first what is wrong.

Check for typos.

Check that your parentheses and quotes are balanced and check that you haven’t misspelled a variable or function name, or used the wrong one.

It’s difficult to identify the exact location where an error starts so you may have to look at lines before the line where the error was reported.

In RStudio, look at the code coloring to find anything that looks off. RStudio will also put a red x or an yellow exclamation point to the left of lines where there is a syntax error.

Try running each command on its own.

Before each command, check that you are passing the values you expect.

After each command, verify that the results seem sensible.

If you’re getting an error, search online for the error message along with the function that is not working.

Consider checking out the following resources to learn more about it.

“5 Essential Tips to Debug Any Piece of Code” by mayuko [video, 8min] - Good general advice for debugging.

“Object of type ‘closure’ is not subsettable” by Jenny Bryan [video, 50min] - A great talk with R specific advice about dealing with errors as a data scientist.

Understanding common bugs

Sometimes you accidentally type things wrong and get unexpected results or errors. We call these mis-types “bugs”. Let’s go through some common ones. The most important things to remember are:

The order of parentheses, quotes, commas, and plusses matters.

Sometimes you accidentally forget a plus where you need one or include one where you don’t.

For each of the examples below, figure out what the bug is and how to fix it. Feel free to copy/paste into RStudio to help you figure it out.
# Bug 1

 ggplot(data = smoking_1990) +
 aes(x = "smoke_pct", y = lung_cancer_pct, color = continent, size = pop/1000000) +
   geom_point() +
   labs(x = "Percent of people who smoke", y = "Percent of people with lung cancer",
        title = "Are lung cancer rates associated with smoking rates?",
       color = "Continent", size = "Population (in millions)") 

# Bug 2

ggplot(data = smoking_1990) +
   aes(x = smoke_pct, y = lung_cancer_pct, color = continent, size = pop/1000000) +
   geom_point() +
   labs(x = "Percent of people who smoke", y = "Percent of people with lung cancer",
        title = "Are lung cancer rates associated with smoking rates?",
       color = "Continent", size = "Population (in millions)"))

# Bug 3
ggplot(data = smoking_1990) +
   aes(x = smoke_pct, y = lung_cancer_pct, color = continent, size = pop/1000000) +
   geom_point() +
   labs(x = "Percent of people who smoke", y = "Percent of people with lung cancer",
        title = "Are lung cancer rates associated with smoking rates?",
       color = "Continent" size = "Population (in millions)")
Solution

Bug 1: We generated a plot, but it doesn’t look like what we expect. The bug is in our mapping of aesthetics: geom_point(aes(x = "smoke_pct", y = lung_cancer_pct, color = continent, size = pop/1000000)). Because "smoke_pct" is in quotation marks, ggplot understands that as a single value, rather than an aesthetic mapped to the smoke_pct variable in the smoking_1990 dataset. To correct this bug, remove the quotes from "smoke_pct" so that ggplot looks for the smoke_pct column in our dataset.

Bug 2: This code generates the following error:
Error: unexpected ')' in:
"      scale_color_brewer(palette = "Set2") +
     labs(x = "Percent of people who smoke", y = "Percent of people  with lung cancer", title = "Are lung cancer rates associated with smoking rates?", size = "Population (in millions)"))"
Although it might be alarming to get this error, it’s actually quite helpful! You can see that the error points out that we have an unexpected closed parentheses “)” in the last two lines of our code. Look closely, and you’ll see that we accidentally put an extra “)” on the labs() layer in the last line of code.

Bug 3: This code generates the following error:
Error: unexpected symbol in:
"                       title = "Are lung cancer rates associated with smoking rates?",
                      color = "Continent" size"
This error message tells us that there was something unexpected in the labs() funtion on either the line where we specified the title or the color. These errors can be some of the hardest to figure out, because the message is not very specific. However, if you look closely, yu will see that we are missing a comma between color = "Continent" and size = "Population (in millions)" inside the label function.

Recap of what we’ve learned so far

Back to top

Now that we’ve made our first plot, let’s review the most important things to remember when plotting with ggplot. Making plots using ggplot is all about layering on information.

First, you have to give the ggplot() function your data.
1. It looks in this data for the information in the columns you tell it to use for your plot.
Then, you have to tell ggplot what specific information from your data you want to plot and how you want that data to show up.
1. You use the aes() function for this.
2. Inside this function you tell it how you want the data to show up (on the x axis, on the y axis, as a color, etc.) and where that data is coming from (the column name in your datset).
Finally, you have to tell ggplot what type of plot you want to make.
1. All ggplot plot types start with the word geom (e.g. geom_point()).
You can customize the labels and colors on your plot to make them nicer and more informative.
1. There’s a lot more you can customize as well. We will go into some of this later on in the lesson.

This is a lot to remember!

Pro-tip

Those of us that use R on a daily basis use cheat sheets to help us remember how to use various R functions. You can find the cheat sheets in RStudio by going to the “Help” menu and selecting “Cheat Sheets”. The ones that will be most helpful in this workshop are “Data visualization with ggplot2”, “Data Transformation with dplyr”, “R Markdown Cheat Sheet”, and “R Markdown Reference Guide”.

For things that aren’t on the cheat sheets, Google is your best friend. Even expert coders use Google when they’re stuck or trying something new!

Let’s take a moment to orient ourselves to our “Data visualization with ggplot2” cheat sheet. What we just went over is summarized in the “Basics” section in the upper left hand side of the front page of your cheat sheet. The other sections contain more information about different geometries and aesthetics you can use in your plots. We will go over some of these in the next section.

Bonus Exercise: Make your own scatter plot

Now create your own scatter plot comparing population and percent of people who smoke. Looking at your plot, can you guess which two countries have the largest populations?

If you have extra time, customize your plot however you want. If there’s something you want to do but don’t know how, try searching on the internet for it.
Solution
ggplot(data = smoking_1990) +
  aes(x = pop, y = smoke_pct) +
  geom_point()
(China and India are the two countries with large populations.)

Plotting for data exploration

Back to top

Now that we’ve made our first plot, we’re going to dig into other ways to visualize data using ggplot. The main goal here is to find meaningful patterns in complex data and create visualizations to convey those patterns.

Discrete Plots

Back to top

The plot type we used to make our first plot, geom_point, works when both the x and y values are continuous. But sometimes one of your values may be discrete (i.e. categorical).

We’ve previously used the discrete values of the continent column to color in our points and lines. But now let’s try moving that variable to the x axis. Let’s say we are curious about comparing the distribution of the lung cancer rates for each of the different continents for the smoking_1990 data. We can do so using a box plot. Try this out yourself in the exercise below!

Plotting and interpreting box plots

Using the smoking_1990 data, use ggplot to create a box plot with continent on the x axis and lung cancer rates on the y axis. The geom you will want to use is geom_boxplot(). You can use the examples from earlier in the lesson as a template to remember how to pass ggplot data and map aesthetics and geometries onto the plot. If you’re really stuck, feel free to use the internet as well!

Which continent tends to have countries with the highest lung cancer rates? The lowest?
Solution
ggplot(data = smoking_1990) +
 aes(x = continent, y = lung_cancer_pct) +
 geom_boxplot()
This type of visualization makes it easy to compare the range and spread of values across groups. The “middle” 50% of the data is located inside the box and outliers that are far away from the central mass of the data are drawn as points. The bar in the middle of the box is the median. Here, we can see that the median bar for Europe is highest, indicating that countries in Europe tend to have higher rates of lung cancer than countries on other continents. Countries in Africa tend to have lower lung cancer rates than countries on other continents.

Bonus Exercise: Other discrete geoms

Take a look a the ggplot cheat sheet. Find all the geoms listed under “one discrete, one continuous.” Try replacing geom_boxplot with one of these other functions.
Example solution
ggplot(data = smoking_1990) +
  aes(x = continent, y = lung_cancer_pct) +
  geom_violin() 

Color vs. Fill

Back to top

Let’s take the boxplot that we made previously and add code to make the color corresponds to continent. Remember how to do that?

ggplot(data = smoking_1990) +
  aes(x = continent, y = lung_cancer_pct, color = continent) +
  geom_boxplot()

Well, that didn’t get all that colorful. That’s because objects like these boxplots have two different parts that have a color: the shape outline, and the inner part of the shape. For geoms that have an inner part, you change the fill color with fill= rather than color=, so let’s try that instead:

ggplot(data = smoking_1990) +
  aes(x = continent, y = lung_cancer_pct, fill = continent) +
  geom_boxplot()

That got more colorful. Neither one of these (color vs. fill) is better than the other here, it’s more up to your personal preference.

Let’s say we want to change the fill of our plots, but to all the same color. Maybe we want our boxplots to be “lightblue”.

Quotes or no quotes?

To change the color of our boxplots to lightblue, do you think we need to put lightblue in quotes or not? Why?

Solution

We want to put it in quotes because it isn’t a column name in our dataset or a variable in our environment.

Let’s try it out without quotes first:

ggplot(data = smoking_1990) +
  aes(x = continent, y = lung_cancer_pct, fill = lightblue) +
  geom_boxplot()

Error in `geom_boxplot()`:
! Problem while computing aesthetics.
ℹ Error occurred in the 1st layer.
Caused by error in `FUN()`:
! object 'lightblue' not found

Like we just discussed, we get an error because when we don’t include quotes, R looks for the lightblue object in our dataframe and our environment, but it doesn’t find it there. Instead, we have to put it in quotes so that R knows not to search for that variable, but instead to actually use the word itself:

ggplot(data = smoking_1990) +
  aes(x = continent, y = lung_cancer_pct, fill = "lightblue") +
  geom_boxplot()

Hmm that’s still not quite what we want. In this example, we placed the fill inside the aes() function, which maps aesthetics to data. In this case, we only have one value: the word “lightblue”. Instead, let’s do this by explicitly setting the color aesthetic inside the geom_boxplot() function. Because we are assigning a color directly and not using any values from our data to do so, we do not need to use the aes() mapping function. Let’s try it out:

ggplot(data = smoking_1990) +
  aes(x = continent, y = lung_cancer_pct) +
  geom_boxplot(fill = "lightblue")

That’s better! R knows many color names. You can see the full list if you run colors() in the console. Since there are so many, you can randomly choose 10 if you run sample(colors(), size = 10).

Choosing a color

Use sample(colors(), size = 10) a few times until you get an interesting sounding color name and swap that out for “lightblue” in the box plot example.

Layers

Back to top

So far we’ve only been adding one geom to each plot, but each plot object can actually contain multiple layers and each layer has it’s own geom. Now let’s add a layer of points on top of our boxplot that will show us the “raw” data:

ggplot(data = smoking_1990) +
  aes(x = continent, y = lung_cancer_pct) +
  geom_boxplot(fill = "lightblue") +
  geom_point()

We’ve drawn the points but most of them stack up on top of each other. One way to make it easier to see all the data is to change the transparency of the points. We can do this using the alpha argument, which decides how transparent to make the points. It takes a value between 0 and 1 where 0 is entirely transparent and 1 is entirely opaque (the default).

Inside aes() or geom?

Let’s say we want to change the transparency of our points to an alpha of 0.3. Do we want alpha to go inside our aes() function or our geom_boxplot()? Why? Test out both and see if you’re right!
Solution

We want alpha to go inside geom_boxplot() since we’re telling ggplot the number we want it to use; it’s not coming from our data.
ggplot(data = smoking_1990) +
  aes(x = continent, y = lung_cancer_pct) +
  geom_boxplot(fill = "lightblue") +
  geom_point(alpha = 0.3)

Bonus: Too many overlapping points for alpha to work?

We have many observations/data points, so even making the points transparent doesn’t really help us see them! Another option is to “jitter” the points. This adds some random variation to the position of the points so you can see them better. We can do this using geom_jitter().

WARNING!!! geom_jitter() changes the position of points and should therefore only be used for discrete variables that don’t have numerical values!!!

Since we are plotting a discrete value on the x axis, and a continuous value on the y axis, we will need to tell geom_jitter() not to change the y value positions. We can do this by setting height = 0 inside the geom. We will also modify the degree to which points are jittered on the x axis by setting the width argument. Feel free to play around with width to get a plot that you like. Remember, we can only do this because the x axis is discrete!
ggplot(data = smoking_1990) +
  aes(x = continent, y = lung_cancer_pct) +
  geom_boxplot(fill = "lightblue") +
  geom_jitter(alpha = 0.3, height = 0, width = 0.05)
That looks better!

Predicting output

What do you think will happen if you switch the order of geom_boxplot() and geom_point()? Why? Test it out to see if you were right.
Solution

Since we plot the geom_point() layer first, the boxplot layer is placed on top of the geom_point() layer, so we cannot see a lot of the points.
ggplot(data = smoking_1990) +
  aes(x = continent, y = lung_cancer_pct) +
  geom_point(, alpha = 0.3) +
  geom_boxplot(fill = "lightblue")

Going back to having the points on top, let’s color the points by continent. If we add a color aesthetic to the plot, then both the boxplot and the points are colored by continent:

ggplot(data = smoking_1990) +
  aes(x = continent, y = lung_cancer_pct, color = continent) +
  geom_boxplot(fill = "lightblue") +
  geom_point(alpha = 0.3)

So how do we make it so that just the points are colored but not the boxplots? Each layer can have it’s own set of aesthetic mappings. So far we’ve been using aes() outside of the other functions. When we do this, we are setting the “default” aesthetic mappings for the plot. But we can also set the asethetics inside the specific geom that we want to change. To do that, you can place an additional aes() inside of that layer:

ggplot(data = smoking_1990) +
  aes(x = continent, y = lung_cancer_pct) +
  geom_boxplot(fill = "lightblue") +
  geom_point(aes(color = continent), alpha = 0.3)

Nice! Both geom_boxplot() and geom_point() will inherit the default values of aes(continent, lung_cancer_pct) in the base plot, but only geom_jitter will also use aes(color = continent).

Bonus: Aesthetics inside the ggplot() function

Instead of mapping our aesthetics to each geom, we can provide default aesthetics by passing the values to the ggplot() function call. Any aesthetics we want to be specific to a layer, we would keep in the geom function for that layer:
ggplot(data = smoking_1990, mapping = aes(x = continent, y = lung_cancer_pct)) +
  geom_boxplot(fill = "lightblue") +
  geom_point(aes(color = continent), alpha = 0.3)
Here, both geom_boxplot() and geom_point() will inherit the default values of aes(continent, lung_cancer_pct) in the base plot, but only geom_point() will also use aes(color = continent).

Bonus Exercise: Make your own violin plot

Now create a violin plot comparing percent of people in a country who smoke by continent.

If you have extra time, customize your plot however you want. If there’s something you want to do but don’t know how, try searching on the internet for it.
Solution
ggplot(data = smoking_1990) +
  aes(x = continent, y = smoke_pct) +
  geom_violin()

Univariate Plots

Back to top

We jumped right into make plots with multiple columns. But what if we wanted to take a look at just one column? This can be really useful if we want to understand how certain continuous exposures or outcomes are distributed in our dataset. In that case, we only need to specify a mapping for x and choose an appropriate geom.

Univariate continuous

Let’s start with a histogram to see the range and spread of the lung cancer rates:

ggplot(smoking_1990) +
  aes(x = lung_cancer_pct) +
  geom_histogram()

`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

This plot shows us that many of the lung cancer rates in our dataset are really low (less than 0.025%), but there are some outliers with higher rates. Another word for data with this shape is right-skewed, because it has a long tail on the right side of the histogram.

When you ran the code to make the histogram, you should not only see the plot in the plot window, but also a message telling you to choose a better bin value. Histograms can look very different depending on the number of bars you decide to draw. The default is 30. Let’s try setting a different value by explicitly passing a bin= argument to the geom_histogram later.

ggplot(smoking_1990) +
  aes(x = lung_cancer_pct) +
  geom_histogram(bins=20)

Try different values like 5 or 50 to see how the plot changes.

Bonus Exercise: One variable plots

Rather than a histogram, choose one of the other geometries listed under “One Variable” continuous plots on the ggplot cheat sheet.
Example solution
ggplot(smoking_1990) +
  aes(x = lung_cancer_pct) +
  geom_density()

Univariate discrete

What if we want to plot a univariate discrete variable, like continent? For this, we can use a bar chart.

Exercise: Discrete univariate plots

Create a bar plot of continent that shows the number of data points we have for each continent. You can try guessing the geom or look it up on the cheat sheet or Internet. Which continents have the most and fewest countries? How can you tell?
Example solution
ggplot(smoking_1990) +
  aes(x = continent) +
  geom_bar()
Africa has the most countries and Oceania has the fewest. We can tell this because Africa has the highest bar and Oceania has the lowest.

Back to top

If you have a lot of different columns to try to plot or have distinguishable subgroups in your data, a powerful plotting technique called faceting might come in handy. When you facet your plot, you basically make a bunch of smaller plots and combine them together into a single image. Luckily, ggplot makes this very easy. Let’s start with the histogram that we were just working with:

ggplot(smoking_1990) +
  aes(x = lung_cancer_pct) +
  geom_histogram(bins=20)

Now, let’s draw a separate box for each continent. We can do this with facet_wrap()

ggplot(smoking_1990) +
  aes(x = lung_cancer_pct) +
  geom_histogram(bins=20) +
  facet_wrap(vars(continent))

Now, it’s easier to see the patterns within and between continents.

Note that facet_wrap requires an extra helper function called vars() in order to pass in the column names. It’s a lot like the aes() function, but it doesn’t require an aesthetic name. We can see in this output that we get a separate box with a label for each continent so that only the values for that continent are in that box.

The other faceting function ggplot provides is facet_grid(). The main difference is that facet_grid() will make sure all of your smaller boxes share a common axis. In this example, we will put the boxes into columns side-by-side so that their y axes all line up. We can do this using the cols argument inside facet_grid.

ggplot(smoking_1990) +
  aes(x = lung_cancer_pct) +
  geom_histogram(bins=20) +
  facet_grid(cols = vars(continent))

Unlike the facet_wrap output where each box got its own x and y axis, with facet_grid(), there is only one y axis along the left.

Exercise: Faceting

Facet the scatter plot we made as our first plot by continent. Are there differences in correlation between continents?
Solution

You can copy all the code from the first plot, or you can use the saved variable that we made above and add to that:
cancer_v_smoke +
  facet_wrap(vars(continent))
There don’t seem to be many differences between continents.

Bonus Exercise: Practice saving

Store the plot you made above in an object named my_plot, and save the plot using ggsave().
Example solution
my_plot <- cancer_v_smoke +
  facet_wrap(vars(continent))

ggsave("cancer_v_smoke_faceted.jpg", plot = my_plot, width=6, height=4)

Plot Themes

Back to top

Our plots are looking pretty nice, but what’s with that grey background? While you can change various elements of a ggplot object manually (background color, grid lines, etc.) the ggplot package also has a bunch of nice built-in themes to change the look of your graph. For example, let’s try adding theme_bw() to our histogram:

ggplot(smoking_1990) +
  aes(x = lung_cancer_pct) +
  geom_histogram(bins=20) +
  facet_grid(cols = vars(continent)) +
  theme_bw()

Try out a few other themes, to see which you like: theme_classic(), theme_linedraw(), theme_minimal().

Rotating x axis labels

Often, you’ll want to change something about the theme that you don’t know how to do off the top of your head. When this happens, you can do an Internet search to help find what you’re looking for. To practice this, search the Internet to figure out how to rotate the x axis labels 90 degrees. Then try it out using the histogram plot we made above.
Solution
ggplot(smoking_1990) +
 aes(x = lung_cancer_pct) +
 geom_histogram(bins=20) +
 facet_grid(cols = vars(continent)) +
 theme_bw() +
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))

Plotting for data exploration recap

Back to top

We learned a lot in this lesson! Let’s go over the key points:

ggplot is a powerful way to make plots.
ggplot is all about layering - you can layer different geometries, aesthetics, labels, and other information onto your plots.
You can customize the color, size, shape, and theme of your plots.
ggplot allows you to easily save publication-quality plots.
There are lots of different plot types. Some of the most useful ones are:
1. Scatter plots (for two continuous variables).
2. Boxplots (for one discrete and one continuous variable).
3. Histograms (for one continuous variable).
4. Bar plots (for one discrete variable).
Faceting allows you to easily make the same plot separated by a discrete variable of interest.

With the skills you’ve learned here, you’re now ready to start doing your own data exploration!

Applying it to your own data

Back to top

Now that we’ve learned how impactful effective data visualization can be, and how to create informative visuals in R, it’s time for you to start thinking about what you want to do with your own data!

Discuss with your group your data and what type of exploratory data analysis you would like to perform.

Questions to answer:

In 1-2 sentences, describe the information covered in your dataset.
How large is your dataset? How many rows? How many columns?
Write down 3 specific questions you could answer with your data set.
Think through 3 data visualizations that can answer the questions you have. Specifically, for each one: a) Write down your question or goal for the plot. b) Write down the variables needed to answer your question. c) Choose a geometry. d) Choose an aesthetic for each variable. e) Draw a draft plot with pen and paper to determine whether you think these choices will work.

Glossary of terms

Back to top

Tibble: the way tabular data is stored in R when using the tidyverse. We may also call it a data frame.
Geometry (geom): this describes the things that are actually drawn on the plot (like points or lines)
Aesthetic: a visual property of the objects (geoms) drawn in your plot (like x position, y position, color, size, etc)
Aesthetic mapping (aes): This is how we connect a visual property of the plot to a column of our data
Labels (labs): Text labels that make your plot clearer to understand.
Facets: Dividing your data into non-overlapping groups and making a small plot for each subgroup
Layer: Each ggplot is made up of one or more layers. Each layer contains one geometry and may also contain custom aesthetic mappings and private data
Theme: Allows you to change and customize the look of your plot.

Key Points

Use read_csv() to read tabular data in R.

Geometries are the visual elements drawn on data visualizations (lines, points, etc.), and aesthetics are the visual properties of those geometries that are assigned to variables in the data (color, position, etc.).

Use ggplot() and geoms to create data visualizations, and save them using ggsave().

R for Data Cleaning

Overview

Teaching: 105 min
Exercises: 30 min

Questions

How can I clean my data in R?

How can I combine two datasets from different sources?

How can R help make my research more reproducible?

Objectives

To become familiar with the functions of the dplyr and tidyr packages.

To be able to clean and prepare datasets for analysis.

To be able to combine two different data sources using joins.

Cleaning up data
Day 1 review
Overview of the lesson
Narrow down rows with filter()
Subset columns using select()
Checking for missing values
Checking for duplicate rows
Grouping and counting rows using group_by()
Make new variables with mutate()
Joining dataframes

Cleaning up data

Back to top

Researchers are often pulling data from several sources, and the process of making data compatible with one another and prepared for analysis can be a large undertaking. Luckily, there are many functions that allow us to do this in R. Yesterday, we worked with the Global Burden of Disease (GBD 2019) dataset, which contains population, smoking rates, and lung cancer rates by year (we only used 1990). Today, we will practice cleaning and preparing a second dataset containing ambient pollution data by location and year, also sourced from the GBD 2019.

It’s always good to go into data cleaning with a clear goal in mind. Here, we’d like to prepare the ambient pollution data to be compatible with our lung cancer data so we can directly compare lung cancer rates to ambient pollution levels (we will do this tomorrow). To make this work, we’d like a dataframe that contains columns with the country name, year, and median ambient pollution levels (in micrograms per cubic meter). We will make this comparison for the first year in these datasets, 1990.

Let’s start with reviewing how to read in the data.

Day 1 review

Opening your Rproject in RStudio.

First, navigate to the un-reports directory however you’d like and open un-report.Rproj. This should open the un-report R project in RStudio. You can check this by seeing if the Files in the bottom right of RStudio are the ones in your un-report directory.

Creating a new R script.

Then create a new R Script file for our work. Open RStudio. Choose “File” > “New File” > “RScript”. Save this file as un_data_cleaning.R.

Loading your data.

Now, let’s import the pollution dataset into our fresh new R session. It’s not clean yet, so let’s call it ambient_pollution_dirty

ambient_pollution_dirty <- read_csv("data/ambient_pollution.csv")

Error in read_csv("data/ambient_pollution.csv"): could not find function "read_csv"

Exercise: What error do you get and why?

Fix the code so you don’t get an error and read in the dataset. Hint: Packages…
Solution

If we look in the console now, we’ll see we’ve received an error message saying that R “could not find the function read_csv()”. What this means is that R cannot find the function we are trying to call. The reason for this usually is that we are trying to run a function from a package that we have not yet loaded. This is a very common error message that you will probably see a lot when using R. It’s important to remember that you will need to load any packages you want to use into R each time you start a new session. The read_csv function comes from the readr package which is included in the tidyverse package so we will just load the tidyverse package and run the import code again:
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.2     ✔ readr     2.1.4
✔ forcats   1.0.0     ✔ stringr   1.5.0
✔ ggplot2   3.4.2     ✔ tibble    3.2.1
✔ lubridate 1.9.2     ✔ tidyr     1.3.0
✔ purrr     1.0.1     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
ambient_pollution_dirty <- read_csv("data/ambient_pollution.csv")
Rows: 9660 Columns: 3
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): location_name
dbl (2): year_id, median

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
As we saw yesterday, the output in your console shows that by doing this, we attach several useful packages for data wrangling, including readr and dplyr. Check out these packages and their documentation at tidyverse.org.

Reminder: Many of these packages, including dplyr, come with “Cheatsheets” found under the Help RStudio menu tab.

Now, let’s take a look at what this data object contains:

ambient_pollution_dirty

# A tibble: 9,660 × 3
   location_name year_id median
   <chr>           <dbl>  <dbl>
Global           1990   40.0
Global           1995   38.9
Global           2000   40.6
Global           2005   40.6
Global           2010   42.7
Global           2011   44.4
Global           2012   46.1
Global           2013   47.1
Global           2014   47.3
Global           2015   46.1
# ℹ 9,650 more rows

It looks like our data object has three columns: location_name, year_id, and median. Median here is the median ambient pollution in micrograms per cubic meter. Scroll through the data object to get an idea of what’s there.

Plotting review: median pollution levels

Let’s refresh out plotting skills. Make a histogram of pollution levels in the ambient_pollution_dirty data object. Feel free to look back at the content from yesterday if you want!

Bonus 1: Facet by year_id to look at histograms of ambient pollution levels for each year in the dataset.

Bonus 2: Make the plot prettier by changing the axis labels, theme, and anything else you want.
Solution
ggplot(ambient_pollution_dirty, aes(x = median)) +
  geom_histogram()
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Bonus 1:
ggplot(ambient_pollution_dirty) +
  aes(x = median) +
  geom_histogram() +
  facet_wrap(~year_id)
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Bonus 2 example:
ggplot(ambient_pollution_dirty, aes(x = median)) +
  geom_histogram() +
  facet_wrap(~year_id) +
  labs(x = 'Median ambient pollution (micrograms per cubic meter)', y = 'Count') +
  theme_bw()
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Overview of the lesson

Back to top

Great, now that we’ve read in the data and practiced plotting, we can start to think about cleaning the data. Remember that our goal is to prepare the ambient pollution data to be compatible with our lung cancer rates so we can directly compare lung cancer rates to ambient pollution levels (we will do this tomorrow). To make this work, we’d like a dataframe that contains columns with the country name, year, and median ambient pollution levels (in micrograms per cubic meter). We will make this comparison for the first year in these datasets, 1990.

What data cleaning do we have to do?

Look back at the three columns in our data object: location_name, year_id, and median. What might we need to take care of in order to merge these data with our lung cancer rates dataset?

Solution

It looks like the location_name column contains values other than countries, and our year_id column has many years, and we are only interested in 1990 for now.

Narrow down rows with `filter()`

Back to top

Let’s start by narrowing the dataset to only the year 1990. To do this, we will use the filter() function. Here’s what that looks like:

filter(ambient_pollution_dirty, year_id = 1990)

Error in `filter()`:
! We detected a named input.
ℹ This usually means that you've used `=` instead of `==`.
ℹ Did you mean `year_id == 1990`?

Oops! We got an error, but don’t panic. Error messages are often pretty useful. In this case, it says that that we used = instead of ==. That’s because we use = (single equals) when naming arguments that you are passing to functions. So here R thinks we’re trying to assign 1990 to year, kind of like we do when we’re telling ggplot what we want our aesthetics to be. What we really want to do is find all of the years that are equal to 1990. To do this, we have to use == (double equals), which we use when testing if two values are equal to each other:

filter(ambient_pollution_dirty, year_id == 1990)

# A tibble: 690 × 3
   location_name                          year_id median
   <chr>                                    <dbl>  <dbl>
Global                                    1990   40.0
Southeast Asia, East Asia, and Oceania    1990   41.3
East Asia                                 1990   45.8
China                                     1990   46.4
Democratic People's Republic of Korea     1990   39.3
Taiwan                                    1990   21.7
Southeast Asia                            1990   28.3
Cambodia                                  1990   26.5
Indonesia                                 1990   25.5
Lao People's Democratic Republic          1990   25.1
# ℹ 680 more rows

Okay, so it looks like we have 690 observations for the year 1990. Some of these are countries, some are regions of the world, and we have at least one global measurement.

Before we move on, I want to show you a tool called a pipe operator that will be really helpful as we continue. Instead of including the data object as an argument, we can use the pipe operator %>% to pass the data value into the filter function. You can think of %>% as another way to type “and then.”

ambient_pollution_dirty %>% filter(year_id == 1990)

# A tibble: 690 × 3
   location_name                          year_id median
   <chr>                                    <dbl>  <dbl>
Global                                    1990   40.0
Southeast Asia, East Asia, and Oceania    1990   41.3
East Asia                                 1990   45.8
China                                     1990   46.4
Democratic People's Republic of Korea     1990   39.3
Taiwan                                    1990   21.7
Southeast Asia                            1990   28.3
Cambodia                                  1990   26.5
Indonesia                                 1990   25.5
Lao People's Democratic Republic          1990   25.1
# ℹ 680 more rows

This line of code will do the exact same thing as our first summary command, but the piping function tells R to use the ambient_pollution_dirty dataframe as the first argument in the next function.

This lets us “chain” together multiple functions, which will be helpful later. Note that the pipe (%>%) is a bit different from using the ggplot plus (+). Pipes take the output from the left side and use it as input to the right side. In other words, it tells R to do the function on the left and then the function on the right. In contrast, plusses layer on additional information (right side) to a preexisting plot (left side).

We can also add an Enter to make it look nicer:

ambient_pollution_dirty %>%
  filter(year_id == 1990)

# A tibble: 690 × 3
   location_name                          year_id median
   <chr>                                    <dbl>  <dbl>
Global                                    1990   40.0
Southeast Asia, East Asia, and Oceania    1990   41.3
East Asia                                 1990   45.8
China                                     1990   46.4
Democratic People's Republic of Korea     1990   39.3
Taiwan                                    1990   21.7
Southeast Asia                            1990   28.3
Cambodia                                  1990   26.5
Indonesia                                 1990   25.5
Lao People's Democratic Republic          1990   25.1
# ℹ 680 more rows

Using the pipe operator %>% and Enter makes our code more readable. The pipe operator %>% also helps to avoid using nested functions and minimizes the need for new variables.

Bonus: Pipe keyboard shortcut

Since we use the pipe operator so often, there is a keyboard shortcut for it in RStudio. You can press Ctrl+Shift+M on Windows or Cmd+Shift+M on a Mac.

Bonus Exercise: Viewing data

Sometimes it can be helpful to explore your data summaries in the View tab. Filter ambient_pollution_dirty to only entries from 1990 and use the pipe operator and View() to explore the summary data. Click on the column names to reorder the summary data however you’d like.
Solution:
ambient_pollution_dirty %>%
  filter(year_id == 1990) %>%
  View()
Once you’re done, close out of that window and go back to the window with your code in it.

Bonus Exercise: sorting columns

We just used the View tab to sort our count data, but how could you use code to sort the median column? Try to figure it out by searching on the Internet.
Solution:
ambient_pollution_dirty %>%
  filter(year_id == 1990) %>% 
  arrange(desc(median)) 
# A tibble: 690 × 3
   location_name year_id median
   <chr>           <dbl>  <dbl>
 1 Qatar            1990   78.2
 2 Niger            1990   70.6
 3 Nigeria          1990   69.4
 4 India            1990   68.4
 5 Egypt            1990   66.1
 6 South Asia       1990   65.2
 7 South Asia       1990   65.2
 8 Cameroon         1990   65.0
 9 Mauritania       1990   64.8
10 Nepal            1990   64.0
# ℹ 680 more rows
The arrange() function is very helpful for sorting data objects based on one or more columns. Notice we also included the function desc(), which tells arrange() to sort in descending order (largest to smallest).

Great! We’ve managed to reduce our dataset to only the rows corresponding to 1990. Now, the year_id column is obsolete, so let’s learn how to get rid of it.

Subset columns using `select()`

Back to top

We use the filter() function to choose a subset of the rows from our data, but when we want to choose a subset of columns from our data we use select(). For example, if we only wanted to see the year (year_id) and median values, we can do:

ambient_pollution_dirty %>%
  select(year_id, median)

# A tibble: 9,660 × 2
   year_id median
     <dbl>  <dbl>
  1990   40.0
  1995   38.9
  2000   40.6
  2005   40.6
  2010   42.7
  2011   44.4
  2012   46.1
  2013   47.1
  2014   47.3
  2015   46.1
# ℹ 9,650 more rows

We can also use select() to drop/remove particular columns by putting a minus sign (-) in front of the column name. For example, if we want everything but the year_id column, we can do:

ambient_pollution_dirty %>%
  select(-year_id)

# A tibble: 9,660 × 2
   location_name median
   <chr>          <dbl>
Global          40.0
Global          38.9
Global          40.6
Global          40.6
Global          42.7
Global          44.4
Global          46.1
Global          47.1
Global          47.3
Global          46.1
# ℹ 9,650 more rows

Selecting columns

Create a dataframe with only the location_name, and year_id columns.

Solution:

There are multiple ways to do this exercise. Here are two different possibilities.

ambient_pollution_dirty %>%
  select(location_name, year_id)

# A tibble: 9,660 × 2
   location_name year_id
   <chr>           <dbl>
Global           1990
Global           1995
Global           2000
Global           2005
Global           2010
Global           2011
Global           2012
Global           2013
Global           2014
Global           2015
# ℹ 9,650 more rows

ambient_pollution_dirty %>%
  select(-median)

# A tibble: 9,660 × 2
   location_name year_id
   <chr>           <dbl>
Global           1990
Global           1995
Global           2000
Global           2005
Global           2010
Global           2011
Global           2012
Global           2013
Global           2014
Global           2015
# ℹ 9,650 more rows

Exercise: use filter() and select() to narrow down our dataframe to only the location_name and median for 1990

Combine the two functions you have learned so far with the pipe operator to narrow down the dataset to the location names and ambient pollution in the year 1990. Save it to an object called pollution_1990_dirty.
Solution
pollution_1990_dirty <- ambient_pollution_dirty %>%
  filter(year_id == 1990) %>%
  select(-year_id)

Great! Now we have our dataset narrowed down and saved to an object.

Checking for duplicate rows

Back to top

Let’s check to see if our dataset contains duplicate rows. We already know that we have some rows with identical location names, but are the rows identical? We can use the distinct() function, which removes any rows for which all values are duplicates of another row, followed by count() to find out.

Getting distinct columns

Find the number of distinct rows in pollution_1990_dirty by piping the data into distinct() and then count().
Solution:
pollution_1990_dirty %>%
  distinct() %>%
  count()
# A tibble: 1 × 1
      n
  <int>
1   688

You can see that after applying the distinct() function, our dataset only has 688 rows. This tells us that there are two rows that were exactly identical to other rows in our dataset.

Note that the distinct() function without any arguments checks to see if an entire row is duplicated. We can check to see if there are duplicates in a specific column by writing the column name in the distinct() function. This is helpful if you need to know whether there are multiple rows for some sample ids, for example. Let’s try it here with location name. Before we do, what do you expect to see?

pollution_1990_dirty %>%
    distinct(location_name) %>%
    count()

# A tibble: 1 × 1
      n
  <int>
1   685

All right, so we expected at least two rows would be eliminated, because we know there are two rows that are completely identical. But here, we can see that there are up to 5 location_name values with multiple rows, suggesting that some locations may have multiple entries with different values. It’s important to check these out because they might indicate issues with data entry or discordant data.

Grouping and counting rows using `group_by()` and `count()`

Back to top

The group_by() function allows us to treat rows in logical groups defined by categories in at least one column. This will allow us to get summary values for each group. The group_by() function expects you to pass in the name of a column (or multiple columns separated by commas) in your data. When we put it together with count(), we will be able to see how many rows are in each group.

Let’s do this for our pollution_1990_dirty dataset:

pollution_1990_dirty %>%
    group_by(location_name) %>%
    count()

# A tibble: 685 × 2
# Groups:   location_name [685]
   location_name      n
   <chr>          <int>
 1 Aceh               1
 2 Acre               1
 3 Afghanistan        1
 4 Africa             1
 5 African Region     1
 6 African Union      1
 7 Aguascalientes     1
 8 Aichi              1
 9 Akershus           1
10 Akita              1
# ℹ 675 more rows

It’s kind of hard to find the ones wth two values. Let’s arrange the counts from highest to lowest using the arrange() function to make it easier to see:

pollution_1990_dirty %>%
    group_by(location_name) %>%
    count() %>%
    arrange(-n)

# A tibble: 685 × 2
# Groups:   location_name [685]
   location_name                    n
   <chr>                        <int>
 1 Georgia                          2
 2 North Africa and Middle East     2
 3 South Asia                       2
 4 Stockholm                        2
 5 Sweden except Stockholm          2
 6 Aceh                             1
 7 Acre                             1
 8 Afghanistan                      1
 9 Africa                           1
10 African Region                   1
# ℹ 675 more rows

Note that you might get a message about the summarize function regrouping the output by ‘location_name’. This simply indicates what the function is grouping by.

We can also group by multiple variables. We’ll do more with this later. Now, we know which locations have multiple entries - but what if we want to look at them?

Review: Filtering to specific location names

Break into your groups of two and choose a location name that has multiple entries. Filter pollution_1990_dirty to look at those entries in the dataset.
Example solution:
pollution_1990_dirty %>%
  filter(location_name == "Georgia")
# A tibble: 2 × 2
  location_name median
  <chr>          <dbl>
1 Georgia         17.9
2 Georgia         15.1

Now, we want to clean these data up so there is only one row per location. To do that, we will need to add a new column with revised pollution levels.

Make new variables with `mutate()`

Back to top

The function we use to create new columns is called mutate(). Let’s go ahead and take care of the location_names which have two different median pollution values by making a new column called pollution that is the mean of median. We can then remove the median column and store the resulting data object as pollution_1990.

pollution_1990 <- pollution_1990_dirty %>%
  group_by(location_name) %>%
  mutate(pollution = mean(median)) %>%
  select(-median) %>%
  distinct()

You can see that pollution_1990 has 685 rows, as we expect, since we took care of the duplicated location_names.

Note: here, we took the mean to take care of duplicates and multiple entries, but this is not always the best way to do so. When working with your own data, make sure to think carefully about your dataset, what these multiple entries really mean, and whether you want to leave them as they are or take care of them in some different way.

Bonus Exercise: Check to see if all rows are distinct

Do we have any duplicated rows in our pollution_1990 dataset now? HINT: You might get an unexpected result. Look at the code we used to make pollution_1990 to try to figure out why.
Example solution:
pollution_1990 %>%
  distinct() %>%
  count()
# A tibble: 685 × 2
# Groups:   location_name [685]
   location_name      n
   <chr>          <int>
 1 Aceh               1
 2 Acre               1
 3 Afghanistan        1
 4 Africa             1
 5 African Region     1
 6 African Union      1
 7 Aguascalientes     1
 8 Aichi              1
 9 Akershus           1
10 Akita              1
# ℹ 675 more rows
Hmm that’s not like the counts we’ve gotten before. That’s because our dataframe is still grouped by location_name. Here, we actually took distinct rows for each group. In actuality, we want distinct rows for the entire dataset (which should be the same thing since each group is unique). To get the output we want, we can use the ungroup() function before calling distinct():
pollution_1990 %>%
  ungroup() %>%
  distinct() %>%
  count()
# A tibble: 1 × 1
      n
  <int>
1   685
Since the number of rows in pollution_1990 is equal to the number of rows after calling distinct, this means we no longer have any distinct rows in our dataset.

Joining dataframes

Back to top

Now we’re almost ready to join our pollution data to the smoking and lung cancer data. Let’s read in our smoking_cancer_1990.csv and save it to an object called smoking_1990.

smoking_1990 <- read_csv("data/smoking_cancer_1990.csv")

Rows: 191 Columns: 6
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (2): country, continent
dbl (4): year, pop, smoke_pct, lung_cancer_pct

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Look at the data in pollution_1990 and smoking_1990. If you had to merge these two dataframes together, which columns would you use to merge them together? If you said location_name and country, you’re right! But before we join the datasets, we need to make sure these columns are named the same thing.

Re-naming columns

Rename the location_name column to country in the pollution_1990 dataset. Store in an object called pollution_1990_clean. HINT: The function you want is part of the dplyr package. Try to guess what the name of the function is, and if you’re having trouble try searching for it on the Internet.
Solution:
pollution_1990_clean <- pollution_1990 %>%
  rename(country = location_name) 
Note that the column is labeled country even though it has values beyond the names of countries. We will take care of this later when we join datasets.

Because the country column is now present in both datasets, we’ll call country our “key”. We want to match the rows in each dataframe together based on this key. Note that the values within the country column have to be exactly identical for them to match (including the same case).

What problems might we run into with merging?

Solution

We might not have pollution data for all of the countries in the smoking_1990 dataset and vice versa. Also, a country might be represented in both dataframes but not by the same name in both places.

The dplyr package has a number of tools for joining dataframes together depending on what we want to do with the rows of the data of countries that are not represented in both dataframes. Here we’ll be using left_join(). In a “left join”, the new dataframe only has those rows for the key values that are found in the first dataframe listed. This is a very commonly used join.

Bonus: Other dplyr join functions

Other joins and can be performed using inner_join(), right_join(), full_join(), and anti_join(). In a “left join”, if the key is present in the left hand dataframe, it will appear in the output, even if it is not found in the the right hand dataframe. For a right join, the opposite is true. For a full join, all possible keys are included in the output dataframe. For an anti join, only ones found in the left data frame are included. Image source

Let’s give the left_join() function a try. We will put our smoking_1990 dataset on the left so that we maintain all of the rows we had in that dataset.

left_join(smoking_1990, pollution_1990_clean)

Joining with `by = join_by(country)`

# A tibble: 191 × 7
    year country            continent    pop smoke_pct lung_cancer_pct pollution
   <dbl> <chr>              <chr>      <dbl>     <dbl>           <dbl>     <dbl>
1990 Afghanistan        Asia      1.24e7      3.12          0.0127     46.2 
1990 Albania            Europe    3.29e6     24.2           0.0327     23.8 
1990 Algeria            Africa    2.58e7     18.9           0.0118     29.4 
1990 Andorra            Europe    5.45e4     36.6           0.0609     13.4 
1990 Angola             Africa    1.18e7     12.5           0.0139     27.7 
1990 Antigua and Barbu… North Am… 6.25e4      6.80          0.0105     16.2 
1990 Argentina          South Am… 3.26e7     30.4           0.0344     14.8 
1990 Armenia            Europe    3.54e6     30.5           0.0441     30.0 
1990 Australia          Oceania   1.71e7     29.3           0.0599      7.13
1990 Austria            Europe    7.68e6     35.4           0.0439     20.3 
# ℹ 181 more rows

We now have data from both datasets joined together in the same dataframe. Notice that the number of rows here, 191, is the same as the number of rows in the smoking_1990 dataset? One thing to note about the output is that left_join() tells us that that it joined by “country”.

Alright, let’s explore this joined data a little bit.

Checking for missing values

Back to top

First, let’s check for any missing values. We will start by using the drop_na() function, which is a tidyverse function that removes any rows that have missing values. Then we will check the number of rows in our dataset using count() and compare to the original to see if we lost any rows with missing data.

left_join(smoking_1990, pollution_1990_clean) %>%
  drop_na() %>%
  distinct() %>% 
  count() 

Joining with `by = join_by(country)`

# A tibble: 1 × 1
      n
  <int>
1   189

It looks like the dataframe has 189 rows after we drop any observations with missing values. This means there are two rows with missing values.

Note that since we used left_join, we expect all the data from the smoking_2019 dataset to be there, so if we have missing values, they will be in the pollution column. We will look for rows with missing values in the pollution column using the filter() function and is.na(), which is helpful for identifying missing data

left_join(smoking_1990, pollution_1990_clean) %>%
  filter(is.na(pollution))

Joining with `by = join_by(country)`

# A tibble: 2 × 7
   year country         continent      pop smoke_pct lung_cancer_pct pollution
  <dbl> <chr>           <chr>        <dbl>     <dbl>           <dbl>     <dbl>
1  1990 Slovak Republic Europe     5299187      33.7          0.0699        NA
2  1990 Vietnam         Asia      67988855      29.4          0.0216        NA

We can see that were missing pollution data for Vietnam and Slovak Republic. Note that we were expecting two rows with missing values, and we found both of them! That’s great news.

If we look at the pollution_1990_clean data by clicking on it in the environment and sort by country, we can see that Vientam and Slovak Republic are called different things in the pollution_1990_clean dataframe. They’re called “Viet Nam” and “Slovakia,” respectively. Using mutate() and case_when(), we can update the pollution_2019 data so that the country names for Vietnam and Slovak Republic match those in the smoking_1990 data. case_when() is a super useful function that uses information from a column (or columns) in your dataset to update or create new columns.

Let’s use case_when() to change “Viet Nam” to “Vietnam”.

pollution_1990_clean %>%
  mutate(country = case_when(country == "Viet Nam" ~ "Vietnam", 
                             TRUE ~ country))

# A tibble: 685 × 2
# Groups:   country [685]
   country                                pollution
   <chr>                                      <dbl>
 1 Global                                      40.0
 2 Southeast Asia, East Asia, and Oceania      41.3
 3 East Asia                                   45.8
 4 China                                       46.4
 5 Democratic People's Republic of Korea       39.3
 6 Taiwan                                      21.7
 7 Southeast Asia                              28.3
 8 Cambodia                                    26.5
 9 Indonesia                                   25.5
10 Lao People's Democratic Republic            25.1
# ℹ 675 more rows

Practicing `case_when()`

Starting with the code we wrote above, add to it to change “Slovakia” to “Slovak Republic”.

One possible solution:

pollution_1990_clean %>%
  mutate(country = case_when(country == "Viet Nam" ~ "Vietnam", 
                             country == "Slovakia" ~ "Slovak Republic",
                             TRUE ~ country))

# A tibble: 685 × 2
# Groups:   country [685]
   country                                pollution
   <chr>                                      <dbl>
 1 Global                                      40.0
 2 Southeast Asia, East Asia, and Oceania      41.3
 3 East Asia                                   45.8
 4 China                                       46.4
 5 Democratic People's Republic of Korea       39.3
 6 Taiwan                                      21.7
 7 Southeast Asia                              28.3
 8 Cambodia                                    26.5
 9 Indonesia                                   25.5
10 Lao People's Democratic Republic            25.1
# ℹ 675 more rows

Checking to see if our code worked

Starting with the code we wrote above, add or modify it see if it worked the way we want it to - did we change “Viet Nam” to “Vietnam” and “Slovakia” to “Slovak Republic” while keeping everything else the same?
One possible solution:
pollution_1990_clean %>%
  mutate(country_new = case_when(country == "Viet Nam" ~ "Vietnam", 
                             country == "Slovakia" ~ "Slovak Republic",
                             TRUE ~ country)) %>% 
  filter(country != country_new)
# A tibble: 2 × 3
# Groups:   country [2]
  country  pollution country_new    
  <chr>        <dbl> <chr>          
1 Viet Nam      25.6 Vietnam        
2 Slovakia      26.1 Slovak Republic

Once we’re sure that our code is working correctly, let’s save this to pollution_2019_clean.

pollution_1990_clean <- pollution_1990_clean %>%
  mutate(country = case_when(country == "Viet Nam" ~ "Vietnam", 
                             country == "Slovakia" ~ "Slovak Republic",
                             TRUE ~ country))

IMPORTANT: Here, we overwrote our pollution_2019_clean dataframe. In other words, we replaced the existing data object with a new one. This is generally NOT recommended practice, but is often needed when first performing exploratory data analysis as we are here. After you finish exploratory analysis, it’s always a good idea to go back and clean up your code to avoid overwriting objects.

Bonus Exercise: Cleaning up code

How would you clean up your code to avoid overwriting pollution_2019_clean as we did above? Hint: start with the pollution_1990 dataframe. Challenge: Start at the very beginning, from reading in your data, and clean it all in one big step (this is what we do once we’ve figured out how we want to clean our data - we then clean up our code).

Solution:

pollution_1990_clean <- pollution_1990 %>%
  rename(country = location_name) %>%
  mutate(country = case_when(country == "Viet Nam" ~ "Vietnam", 
                             country == "Slovakia" ~ "Slovak Republic",
                             TRUE ~ country))

Challenge solution:

pollution_1990_clean <- read_csv("data/ambient_pollution.csv") %>%
  filter(year_id == 1990) %>%
  select(-year_id) %>%
  group_by(location_name) %>%
  mutate(pollution = mean(median)) %>%
  ungroup() %>%
  select(-median) %>%
  distinct() %>%
  rename(country = location_name) %>%
  mutate(country = case_when(country == "Viet Nam" ~ "Vietnam", 
                             country == "Slovakia" ~ "Slovak Republic",
                             TRUE ~ country))

Rows: 9660 Columns: 3
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): location_name
dbl (2): year_id, median

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Alright, now let’s left_join() our dataframes again and filter for missing values to see how it looks.

left_join(smoking_1990, pollution_1990_clean) %>%
  filter(is.na(pollution))

Joining with `by = join_by(country)`

# A tibble: 0 × 7
# ℹ 7 variables: year <dbl>, country <chr>, continent <chr>, pop <dbl>,
#   smoke_pct <dbl>, lung_cancer_pct <dbl>, pollution <dbl>

Now you can see that we have an empty dataframe! That’s great news; it means that we do not have any rows with missing pollution data.

Finally, let’s use left_join() to create a new dataframe:

smoking_pollution <- left_join(smoking_1990, pollution_1990_clean)

Joining with `by = join_by(country)`

We have reached our data cleaning goal! One of the best aspects of doing all of these steps coded in R is that our efforts are reproducible, and the raw data is maintained. With good documentation of data cleaning and analysis steps, we could easily share our work with another researcher who would be able to repeat what we’ve done. However, it’s also nice to have a saved csv copy of our clean data. That way we can access it later without needing to redo our data cleaning, and we can also share the cleaned data with collaborators. To save our dataframe, we’ll use write_csv().

write_csv(smoking_pollution, "data/smoking_pollution.csv")

Great - Now our data is ready to analyze tomorrow!

Applying it to your own data

Back to top

Now that we’ve learned how clean data, it’s time to read in, clean, and make plots with your own data! Use your ideas from your brainstorming session yesterday to help you get started, but feel free to branch out and explore other things as well. Let us know if you have questions; we’re here to help.

Make sure you have your R project opened in R Studio.
Open a new file in R and save it with an informative name.
Read in your data.
Explore your data and clean as needed.
- What did you identify that you have to address before you can start analyzing the data?
  - e.g. missing data, column names with spaces, columns with both numbers and characters
Create at least 3 plots of your data that help answer the questions you posed yesterday.

Answer the questions below as you go through these steps.

What did you learn as you explored your data? Did you have to modify your questions, and if so, why and how?
What did you have to do to clean your data?
What plots did you work on that relate to your questions of interest?

Glossary of terms

Back to top

Pipe (%>%): takes input (before pipe) and then performs next step (after pipe).
filter(): keeps only certain rows.
select(): keeps only certain columns.
group_by(): groups rows by a certain column.
mutate(): makes new columns.
count(): counts rows; if grouped, counts within groups.
drop_na(): removes any rows with NA values.
duplicated(): removes any rows that are entirely duplicated.
left_join(): joins two dataframes by common column names, keeps all rows in left dataframe.
case_when(): uses information from columns to update/create a column
write_csv(): saves dataframe to a csv file.

Key Points

Package loading is an important first step in preparing an R environment.

Assessing data source and structure is an important first step in analysis.

There are many useful functions in the tidyverse packages that can aid in data analysis.

Preparing data for analysis can take significant effort and planning.

R for Data Analysis

Overview

Teaching: 75 min
Exercises: 15 min

Questions

How can I summarise my data in R?

How can R help make my research more reproducible?

Objectives

To become familiar with the functions of the dplyr and tidyr packages.

To be able to create plots and summary tables to answer analysis questions.

Day 2 review
Overview of the lesson
Get stats fast with summarise()
Plotting for exploratory data analysis
Bonus content (#bonus-content)
Calculating percentages
Changing the shape of the data
Plotting wide data
Additional practice
Applying it to your own data

Day 2 review

Yesterday we learned all about data cleaning, but we didn’t cover everything.

Make a list of what we learned yesterday related to data cleaning.
Make a list of things we didn’t learn yesterday that you sometimes have to do while data cleaning.
Choose one item from the list of things we didn’t learn and use the Internet to search for a way to do it using the tidyverse.

Overview of the lesson

So far, you’ve learned how to load, plot, merge, and clean data. In the process, you’ve learned a lot of new functions which are useful for transforming data. Today, we are going to put all those new skills you learned together and learn a few new functions that will be really helpful for exploratory data analysis. We’ll start with a few examples of how plotting can be a really useful tool for exploratory data analysis. Then, we’ll learn a function that will help us get summary statistics for our data and compare those summary statistics to plots. We’ll also learn how to calculate proportions and percentages. Finally, we’ll learn how to change the shape of data to make certain analyses more straight forward. First, let’s read in the data on smoking, lung cancer rates, and pollution that we generated yesterday:

library(tidyverse) 

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.2     ✔ readr     2.1.4
✔ forcats   1.0.0     ✔ stringr   1.5.0
✔ ggplot2   3.4.2     ✔ tibble    3.2.1
✔ lubridate 1.9.2     ✔ tidyr     1.3.0
✔ purrr     1.0.1     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

smoking_pollution <- read_csv("data/smoking_pollution.csv")

Rows: 191 Columns: 7
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (2): country, continent
dbl (5): year, pop, smoke_pct, lung_cancer_pct, pollution

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Get stats fast with `summarise()`

Back to top

Let’s say we would like to know the mean (average) smoking rate in the dataset. R has a built in function function called mean() that will calculate this value for us. We can apply that function to our smoke_pct column using the summarise() function. Here’s what that looks like:

smoking_pollution %>%
    summarise(mean_smoke_pct = mean(smoke_pct))

# A tibble: 1 × 1
  mean_smoke_pct
           <dbl>
1           23.9

When we call summarise(), we can use any of the column names of our data object as values to pass to other functions. summarise() will return a new data object and our value will be returned as a column.

Note: The summarise() and summarise() functions perform identical functions.

The mean_smoke_pct= part tells summarise() to use “mean_smoke_pct” as the name of the new column. Note that you don’t have to have quotes around this new name as long as it starts with a letter and doesn’t include a space.

When you call summarise(), you can also create more than one new column. To do so, you must separate your columns with a comma. Building on the code from above, let’s add a new column that calculates the minimum and maximum percent of smokers.

smoking_pollution %>%
  summarise(mean_smoke_pct=mean(smoke_pct),
            min_smoke_pct=min(smoke_pct),
            max_smoke_pct=max(smoke_pct))

# A tibble: 1 × 3
  mean_smoke_pct min_smoke_pct max_smoke_pct
           <dbl>         <dbl>         <dbl>
1           23.9          3.12          46.9

Perhaps one of the most powerful ways to use summarise() is to combine it with group_by(). This enables you to calculate summary statistics for specific groups. For example, suppose we wanted to calculate the the mean, min, and max smoke_pct for each continent. How would you modify the code above?

smoking_pollution %>%
    group_by(continent) %>%
    summarise(mean_smoke_pct=mean(smoke_pct),
            min_smoke_pct=min(smoke_pct),
            max_smoke_pct=max(smoke_pct))

# A tibble: 6 × 4
  continent     mean_smoke_pct min_smoke_pct max_smoke_pct
  <chr>                  <dbl>         <dbl>         <dbl>
1 Africa                  15.0          3.81          31.9
2 Asia                    25.1          3.12          39.7
3 Europe                  34.2         21.8           46.1
4 North America           16.1          6.55          33.1
5 Oceania                 33.3         21.2           46.9
6 South America           24.4          8.33          39.5

Exercise: Summary stats and boxplots

Part 1: Use group_by() and summarise() to find the median, min, max, and interquartile range of lung_cancer_pct for each continent.

Part 2: Make a box plot of pollution on the y axis and continent on the x axis. Compare your plot to your table. What do you notice? Which do you think is easier to interpret?
Solution

Part 1: Use group_by() and summarise() to find the median, min, max, and interquartile range of pollution for each continent.
smoking_pollution %>%
  group_by(continent) %>%
  summarise(med_pollution = median(pollution),
min_pollution = min(pollution),
max_pollution = max(pollution),
iqr_pollution = IQR(pollution)) 
# A tibble: 6 × 5
  continent     med_pollution min_pollution max_pollution iqr_pollution
  <chr>                 <dbl>         <dbl>         <dbl>         <dbl>
1 Africa                35.7          14.2           70.6         24.1 
2 Asia                  29.4           7.63          78.2         22.4 
3 Europe                19.6           6.81          43.7          9.69
4 North America         18.5           8.14          38.1          4.64
5 Oceania                8.05          4.69          14.3          2.45
6 South America         20.2           9.97          45.3         12.2 
Part 2: Make a box plot of pollution on the y axis and continent on the x axis. Compare your plot to your table. What do you notice?
smoking_pollution %>%
    ggplot(aes(x = continent, y = pollution)) +
    geom_boxplot()
When comparing your table to your plot, you’ll notice that the dark horizontal lines represent median values. The boxes have lengths equal to the interquartile range (IQR). And the highest and lowest values for each continent match the table as well. The plot makes it easier to see differences between continents. The table provides finer details for comparison. What you choose to report will depend on whether you want to bring attention to those finer details or whether you want to discuss overall trends.

Plotting for exploratory data analysis

Back to top

For our analysis, we have three questions we’d like to answer:

Is there a relationship between population and ambient pollution levels (in micrograms per cubic meter)?
Which continent has the highest pollution levels per capita?
Is there a relationship between ambient pollution levels per capita and lung cancer rates?

1) Is there a relationship between population and ambient pollution levels (in micrograms per cubic meter)?

To answer this question, we’ll plot ambient pollution levels against population using a scatter plot. It will help to scale the x axis (population) log 10.

smoking_pollution %>%
  ggplot(aes(x = pop, y = pollution)) +
  geom_point() +
  scale_x_log10() +
  labs(x = "Population", y = "Ambient pollution levels (micrograms/cubic meter)", 
       size = "Population\n(millions)") +
  theme_bw()

We observe a positive association between ambient pollution levels and population.

To help clarify the association, we can add a fit line through the data using geom_smooth(method = "lm"). Notice we added the method = "lm" argument. This tells geom_smooth() that we would like a linear model (lm) fit to the data.

smoking_pollution %>%
  ggplot(aes(x = pop, y = pollution)) +
  geom_point() +
  geom_smooth(method = "lm") +
  scale_x_log10() +
  labs(x = "Population", y = "Ambient pollution levels (micrograms/cubic meter)", size = "Population\n(millions)") +
  theme_bw() 

`geom_smooth()` using formula = 'y ~ x'

To answer our first question, we observe a positive association between population and ambient pollution. In other words, countries with higher populations tend to have higher ambient pollution levels. It is very important to remember that associations are not indicative of causality and there could be confounding variables that may be playing into this apparent relationship. Can you think of any confounding factors we haven’t accoutned for?

Challenge: 2) which continent has the highest pollution levels per capita?

To answer this question, we need to calculate the pollution levels per capita for each country using mutate(). Then plot a boxplot to look at these levels by continent. Hint: it may help to scale the y axis log10
Solution:
smoking_pollution %>%
  mutate(pollution_capita = pollution/pop) %>%
  ggplot(aes(x = continent, y = pollution_capita)) +
  geom_boxplot() +
  scale_y_log10() +
  labs(y = "Pollution (micrograms/cubic meter) per capita")+
  theme_bw()
Which continent has the highest pollution levels per capita? What other factors do you think could be driving this observation?

Challenge: 3) Is there a relationship between ambient pollution levels per capita and lung cancer rates?

To answer this question, let’s make a scatter plot with ambient pollution levels on the x axis and lung cancer rates on the y axis. Hint: Make sure to scale the x-axis log10.
Solution:
smoking_pollution %>%
  mutate(pollution_capita = pollution/pop) %>%
  ggplot(aes(x = pollution_capita, y = lung_cancer_pct)) +
  geom_point() +
  scale_x_log10() +
  labs(x = "Pollution (micrograms/cubic meter) per capita", y = "Percent of people with lung cancer")+
  theme_bw()
There does not appear to be a direct relationship between pollution and lung cancer rates.

Bonus content

Back to top

Calculating percentages

Back to top

Finding percentages using dplyr can be a little bit complicated. However, it’s a very useful skill! We’ve included an exercise here that provides an example for how to caluclate percentages.

Percentages

What percentage of the global population in 1990 did Africa make up? What percentage of the population in Africa did Kenya make up?

Solution

Create a new variable using group_by() and mutate() that calculates percentages for the pop variable.

smoking_pollution %>%
  mutate(total_pop = sum(pop)) %>% #total_pop is the global population
  group_by(continent) %>%  #grouping by continent allows us to calculate the population on each continent
  mutate(cont_pop = sum(pop), #cont_pop is the continental population
         cont_percent = cont_pop/total_pop * 100, #cont_percent is the percent of the global population for the continent
         country_cont_pct = pop/cont_pop * 100) %>% #country_cont_pct is the percent of the continent population for a given country
  select(country, continent, cont_percent, country_cont_pct) %>%
  filter(country == "Kenya")

# A tibble: 1 × 4
# Groups:   continent [1]
  country continent cont_percent country_cont_pct
  <chr>   <chr>            <dbl>            <dbl>
1 Kenya   Africa            12.1             3.77

This table shows that Kenya makes up 4% of the population of Africa, and Africa makes up 12% of the global population.

Changing the shape of the data

Back to top

Data comes in many shapes and sizes, and one way we classify data is either “wide” or “long.” Data that is “long” has one row per observation. The smoking data is in a long format. We have one row for each country for each year and each different measurement for that country is in a different column. We might describe this data as “tidy” because it makes it easy to work with ggplot2 and dplyr functions (this is where the “tidy” in “tidyverse” comes from). As tidy as it may be, sometimes we may want our data in a “wide” format. Typically in “wide” format each row represents a group of observations and each value is placed in a different column rather than a different row. For example, let’s read in a smoking and lung cancer data set that covers many years and take a look at it:

smoking_cancer <- read_csv("data/smoking_cancer.csv")

Rows: 5719 Columns: 6
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (2): country, continent
dbl (4): year, pop, smoke_pct, lung_cancer_pct

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

It has one row for each country for each year the data were collected. But maybe we want only one row per country and want to spread the percent of smokers values into different columns (one for each year).

The tidyr package contains the functions pivot_wider() and pivot_longer() that make it easy to switch between the two formats. The tidyr package is included in the tidyverse package so we don’t need to do anything to load it.

Let’s create a wide version of our data using pivot_wider():

smoking_cancer %>%
  group_by(country, continent, year) %>% 
  summarise(smoke_pct = mean(smoke_pct)) %>%
  pivot_wider(names_from = year, values_from = smoke_pct)

`summarise()` has grouped output by 'country', 'continent'. You can override
using the `.groups` argument.

# A tibble: 191 × 32
# Groups:   country, continent [191]
   country     continent `1990` `1991` `1992` `1993` `1994` `1995` `1996` `1997`
   <chr>       <chr>      <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
 1 Afghanistan Asia        3.12   3.29   3.53   3.77   4.00   4.25   4.53   4.82
 2 Albania     Europe     24.2   24.1   24.0   24.0   23.9   23.8   23.8   23.9 
 3 Algeria     Africa     18.9   18.7   18.6   18.4   18.2   18.0   17.8   17.6 
 4 Andorra     Europe     36.6   36.5   36.3   36.2   35.9   35.5   35.2   34.9 
 5 Angola      Africa     12.5   12.4   12.2   12.0   11.8   11.5   11.3   11.0 
 6 Antigua an… North Am…   6.80   6.94   7.06   7.17   7.27   7.36   7.43   7.47
 7 Argentina   South Am…  30.4   30.1   29.9   29.7   29.5   29.4   29.2   29.1 
 8 Armenia     Europe     30.5   30.4   30.2   30.0   29.9   29.7   29.6   29.5 
 9 Australia   Oceania    29.3   28.8   28.4   27.9   27.5   27.0   26.4   25.9 
10 Austria     Europe     35.4   35.8   36.2   36.6   37.0   37.4   37.8   38.1 
# ℹ 181 more rows
# ℹ 22 more variables: `1998` <dbl>, `1999` <dbl>, `2000` <dbl>, `2001` <dbl>,
#   `2002` <dbl>, `2003` <dbl>, `2004` <dbl>, `2005` <dbl>, `2006` <dbl>,
#   `2007` <dbl>, `2008` <dbl>, `2009` <dbl>, `2010` <dbl>, `2011` <dbl>,
#   `2012` <dbl>, `2013` <dbl>, `2014` <dbl>, `2015` <dbl>, `2016` <dbl>,
#   `2017` <dbl>, `2018` <dbl>, `2019` <dbl>

Notice here that we tell pivot_wider() which columns to pull the names we wish our new columns to be named from the year variable, and the values to populate those columns from the smoke_pct variable. (Again, neither of which have to be in quotes in the code when there are no special characters or spaces - certainly an incentive not to use special characters or spaces!) We see that the resulting table has new columns by year, and the values populate it with our remaining variables dictating the rows.

Plotting wide data

Back to top

Let’s make a plot with our wide data comparing percent of smokers in 1990 to percent of smokers in 2010 to see how it has changed for each country.

smoking_cancer %>%
  group_by(country, continent, year) %>% 
  summarise(smoke_pct = mean(smoke_pct)) %>%
  pivot_wider(names_from = year, values_from = smoke_pct) %>% 
  ggplot(aes(x = 1990, y = 2010)) +
  geom_point()

`summarise()` has grouped output by 'country', 'continent'. You can override
using the `.groups` argument.

Hmm that’s not what we want. ggplot just plotted the numbers 1990 and 2010 instead of the data from the years. That’s because it evaluates those as numbers instead of column names. To fix this, we can add a prefix to the years in pivot_wider():

smoking_cancer %>%
  group_by(country, continent, year) %>% 
  summarise(smoke_pct = mean(smoke_pct)) %>%
  pivot_wider(names_from = year, values_from = smoke_pct, names_prefix = 'y') %>% 
  ggplot(aes(x = y1990, y = y2010)) +
  geom_point()

`summarise()` has grouped output by 'country', 'continent'. You can override
using the `.groups` argument.

Alright, now we have a plot with the mean percent of smokers in in 1990 on the x axis and the mean percent of smokers in 2010 on the y axis, and each point represents a country. However, the different ranges on the x and y axis make it hard to compare the points. Let’s fix that by adding a line at y=x.

smoking_cancer %>%
  group_by(country, continent, year) %>% 
  summarise(smoke_pct = mean(smoke_pct)) %>%
  pivot_wider(names_from = year, values_from = smoke_pct, names_prefix = 'y') %>% 
  ggplot(aes(x = y1990, y = y2010)) +
  geom_point() +
  geom_abline(intercept = 0, slope = 1)

`summarise()` has grouped output by 'country', 'continent'. You can override
using the `.groups` argument.

It seems like in most countries the percent of smokers has decreased from 1990 to 2010, since most of the points fall below the line y = x. However, there are some countries where smoking has increased (i.e. the points are above the line y = x). Let’s figure out which those are!

Bonus: Identifying countries with more smokers in 2010 than 1990

Use what you’ve learned from today to figure out which countries had higher smoking percentage in 2010 than 1990.

Bonus: Order the data frame from greatest to smallest difference. HINT: The arrange() function can help you do this.

Solution

smoking_cancer %>%
  group_by(country, continent, year) %>% # group by the columns you want to keep
  summarise(smoke_pct = mean(smoke_pct)) %>% # summarise to get one value per country per year
  pivot_wider(names_from = year, values_from = smoke_pct, names_prefix = 'y') %>% # pivot wider
  mutate(diff = y2010 - y1990) %>% # find the difference between the years of interest
  select(country, continent, diff) %>% # select the columns of interest
  filter(diff > 0) %>% # filter to ones where the difference is greater than zero
  arrange(-diff) # bonus - arrange by diff, highest to lowest

`summarise()` has grouped output by 'country', 'continent'. You can override
using the `.groups` argument.

# A tibble: 42 × 3
# Groups:   country, continent [42]
   country                continent  diff
   <chr>                  <chr>     <dbl>
 1 Bosnia and Herzegovina Europe     9.33
 2 Lebanon                Asia       7.91
 3 Afghanistan            Asia       6.08
 4 Albania                Europe     5.23
 5 Indonesia              Asia       5.07
 6 Saudi Arabia           Asia       4.21
 7 Uzbekistan             Asia       4.04
 8 Kiribati               Oceania    3.68
 9 Mali                   Africa     3.03
10 Djibouti               Africa     2.83
# ℹ 32 more rows

Additional practice

Let’s go back to the very first question we talked about today: Is there a relationship between population and ambient pollution levels (in micrograms per cubic meter)? In addition to making a scatterplot, another way to get at this question is by calculating a correlation coefficient. We will cover two correlation coefficients here: Pearson’s (which assumes a linear relationship) and Spearman’s (which doesn’t assume a linear relationship).

There is a function in base R that calculates correlation coefficients (cor()), but is kind of hard to use with the tidy way that we’re used to doing things. So we’re going to download another package that’s part of the tidyverse, but not the core set of packages that we downloaded originally, called corrr (that’s not a typo - there are actually 3 r’s at the end). This package has a function called correlate() that makes it easy to find correlations between variables.

Take the following steps to calculate the Pearson and Spearman correlations between population and ambient pollution levels:

Install and load the corrr package.
Subset smoking_pollution to the population and ambient pollution level columns.
Find the Pearson correlation between smoking and pollution using the correlate() function from the corrr package.
Find the Spearman correlation between smoking and pollution using the correlate() function from the corrr package.

Solution

# install.packages('corrr') # only run this once
library(corrr)
smoking_pollution %>%
  select(pop, pollution) %>%
  correlate(method = 'pearson')

Correlation computed with
• Method: 'pearson'
• Missing treated using: 'pairwise.complete.obs'

# A tibble: 2 × 3
  term         pop pollution
  <chr>      <dbl>     <dbl>
1 pop       NA         0.183
2 pollution  0.183    NA    

smoking_pollution %>%
  select(pop, pollution) %>%
  correlate(method = 'spearman')

Correlation computed with
• Method: 'spearman'
• Missing treated using: 'pairwise.complete.obs'

# A tibble: 2 × 3
  term         pop pollution
  <chr>      <dbl>     <dbl>
1 pop       NA         0.310
2 pollution  0.310    NA    

Additional practice

Remember that we made a scatter plot of year vs. population, separated into a plot for each contient, and that it had 2 outliers:

library(tidyverse)
smoking <- read_csv('data/smoking_cancer.csv')

smoking %>% 
  ggplot(aes(x=year,y=pop)) +
  geom_point() +
  facet_wrap(vars(continent))

Write some code to figure out which countries these are (even if you already know!).

Solution
smoking %>% filter(pop > 5e8) %>% select(country) %>% distinct()
# A tibble: 2 × 1
  country
  <chr>  
1 China  
2 India  
Here we used the distinct() function, which we first saw yesterday. This function is not required to find the answer to this question, but it helps us get the answer a bit more quickly.

Next, plot year vs. population separated into a plot for each continent but excluding the 2 outlier countries. Note that usually you don’t want to exclude certain data points from a plot because it is misleading (see Bonus 2 for an alternative).

Solution
smoking %>% 
filter(country != 'China') %>% 
filter(country != 'India') %>% 
ggplot(aes(x=year,y=pop)) +
geom_point() +
facet_wrap(vars(continent))
Another solution is to use only one filter command and separate the two true/false statements with an ampersand (&) or comma (,), which means that you want to exclude both China and India:
smoking %>% 
filter(country != 'China' & country != 'India') %>% 
ggplot(aes(x=year,y=pop)) +
geom_point() +
facet_wrap(vars(continent))

Bonus 1: Instead of hard-coding the two countries to remove them, remove the two outliers by combining your solutions to the first two questions.

Solution

smoking %>% 
filter(pop < 5e8) %>% 
ggplot(aes(x=year,y=pop)) +
geom_point() +
facet_wrap(vars(continent))

Bonus 2: How can you make the differences between countries more visible on the plot without excluding the two countries you identified above?

Solution

You can scale the y axis using a log10 scale to make the differences more visible:
smoking %>% 
ggplot(aes(x=year,y=pop)) +
geom_point() +
scale_y_log10() +
facet_wrap(vars(continent))

Applying it to your own data

Continue working on your project. Now you can generate some summary statistics as well!

Key Points

Package loading is an important first step in preparing an R environment.

There are many useful functions in the tidyverse packages that can aid in data analysis.

Writing Reports with R Markdown

Overview

Teaching: 75 min
Exercises: 15 min

Questions

How can I make reproducible reports using R Markdown?

How do I format text using Markdown?

Objectives

To create a report in R Markdown that combines text, code, and figures.

To use Markdown to format our report.

To understand how to use R code chunks to include or hide code, figures, and messages.

To be aware of the various report formats that can be rendered using R Markdown.

R for data analysis review
What is R Markdown and why use it?
Creating an R Markdown file
Basic components of R Markdown
- Header
- Code chunks
- Text
Starting the report
Customizing the report
- Table format
- Messages
- Code
- Bonus
Formatting
- Headers
- Lists
Applying it to your own data

R for data analysis review

Review: Creating summaries

Read in the 1990 smoking dataset and find the mean, median, min, and max population for each continent.

Solution

library(tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.2     ✔ readr     2.1.4
✔ forcats   1.0.0     ✔ stringr   1.5.0
✔ ggplot2   3.4.2     ✔ tibble    3.2.1
✔ lubridate 1.9.2     ✔ tidyr     1.3.0
✔ purrr     1.0.1     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

smoking <- read_csv('data/smoking_cancer_1990.csv')

Rows: 191 Columns: 6
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (2): country, continent
dbl (4): year, pop, smoke_pct, lung_cancer_pct

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

smoking %>% 
  group_by(continent) %>% 
  summarise(mean_pop = mean(pop),
            median_pop = median(pop),
            min_pop = min(pop),
            max_pop = max(pop))

# A tibble: 6 × 5
  continent      mean_pop median_pop min_pop    max_pop
  <chr>             <dbl>      <dbl>   <dbl>      <dbl>
1 Africa        11655968.   6788686.   69507   95212454
2 Asia          75684706.  12446168   223159 1135185000
3 Europe        13056650.   5140939    24124   79433029
4 North America 18257642.   2470946    40260  249623000
5 Oceania        1907505     121440.    8910   17065100
6 South America 24606701.  11752774   405169  149003225

What is R Markdown and why use it?

Back to top

Recall that our goal is to generate a report on how a country’s smoking rate is related to its lung cancer rate.

Discussion

How do you usually share data analyses with your collaborators?

Solution

Many people share them through a Word or PDF document, a spreadsheet, slides, a graphic, etc.

In R Markdown, you can incorporate ordinary text (ex. experimental methods, analysis and discussion of results) alongside code and figures! (Some people write entire manuscripts in R Markdown.) This is useful for writing reproducible reports and publications, sharing work with collaborators, writing up homework, and keeping a bioinformatics notebook. Because the code is emedded in the document, the tables and figures are reproducible. Anyone can run the code and get the same results. If you find an error or want to add more to the report, you can just re-run the document and you’ll have updated tables and figures! This concept of combining text and code is called “literate programming”. To do this we use R Markdown, which combines Markdown (renders plain text) with R. You can output an html, PDF, or Word document that you can share with others. In fact, this webpage is an example of a rendered R markdown file!

(If you are familiar with Jupyter notebooks in the Python programming environment, R Markdown is R’s equivalent of a Jupyter notebook.)

Creating an R Markdown file

Back to top

Now that we have a better understanding of what we can use R Markdown files for, let’s start writing a report!

To create an R Markdown file:

Open RStudio
Go to File → New File → R Markdown
Give your document a title, something like “The relationship between smoking and lung cancer rates” (Note: this is not the same as the file name - it’s just a title that will appear at the top of your report)
Keep the default output format as HTML
Note: R Markdown files always end in .Rmd

R Markdown Outputs

The default output for an R Markdown report is HTML, but you can also use R Markdown to output other report formats. For example, you can generate PDF reports using R Markdown, but you must install TeX to do this.

Basic components of R Markdown

Back to top

The first part of every R markdown file is a header at the top of the file between the lines of ---. This contains instructions for R to specify the type of document to be created and options to choose (ex., title, author, date). These are in the form of key-value pairs (key: value; YAML).

Here’s an example:

---
title: 'The relationship between smoking and lung cancer rates'
author: "Zena Lapp"
date: "July 14, 2022"
output: html_document
---

Code chunks

Back to top

The next section is a code chunk, or embedded R code, that sets up options for all code chunks. Here is the default when you create a new R Markdown file:

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```

All code chunks have this format:

```{r}
# Your code here
```

All of the code is enclosed in 3 back ticks (), and the {r} part indicates that it’s a chunk of R code.

You can also include other information within the curly brackets to indicate different information about that code chunk. For instance, the first code block is named “setup”, and include=FALSE prevents code and results from showing up in the output file.

Inside the code chunk, you can put any R code that you want to run, and you can have as many code chunks as you want in your file.

As we mentioned above, in the first code chunk you set options for the entire file. echo = TRUE means that you want your code to be shown in the output file. If you change this to echo = FALSE, then the code will be hidden and only the output of the code chunks will be seen in the output file. There are also many other options that you can change, but we won’t go into those details in this workshop.

Text

Back to top

Finally, you can include text in your R Markdown file. This is any text or explanation you want to include, and it’s formatted with Markdown. We’ll learn more about Markdown formatting soon!

Starting the report

Back to top

Let’s return to the new R Markdown file you created and delete everything below the setup code chunk. (That stuff is just examples and reminders of how to use R Markdown.)

Next, let’s save our R markdown file to the reports directory. You can do this by clicking the save icon in the top left or using control + s (command + s on a Mac).

Change knit directory

There’s one other thing that we need to do before we get started with our report. To render our documents into html format, we can “knit” them in R Studio. Usually, R Markdown renders documents from the directory where the document is saved (the location of the .Rmd file), but we want it to render from the main project directory where our .Rproj file is. This is because that’s where all of our relative paths are from and it’s good practice to have all of your relative paths from the main project directory. To change this default, click on the down arrow next to the “Knit” button at the top left of R Studio, go to “Knit Directory” and click “Project Directory”. Now it will assume all of your relative paths for reading and writing files are from the un-report directory, rather than the reports directory.

Now that we have that set up, let’s start on the report!

Add code

We’re going to use the code you generated yesterday to plot smoking rates vs. lung cancer rates to include in the report. Recall that we needed a couple R packages to generate these plots. We can create a new code chunk to load the needed packages. You could also include this in the previous setup chunk, it’s up to your personal preference. To create a new code chunk, you have several options: type it out yourself, click the button with the green c and + in the top right, next to Run, or use the keyboard shortcut Ctrl + Alt + i.

```{r packages}
library(tidyverse)
```

Now, in a real report this is when we would type out the background and purpose of our analysis to provide context to our readers. However, since writing is not a focus of this workshop we will avoid lengthy prose and stick to short descriptions. You can copy the following text into your own report below the package code chunk.

This report was prepared to analyze the relationship between a country's lung cancer rate, smoking rate, and air pollution. Our goal is to determine to what degree the percent of people who smoke and the amount of air pollution per capita may be related to its lung cancer rate. We hypothesize that lung cancer rates increase with both percent of people who smoke and the amount of air pollution per capita.

Now, since we want to show our results comparing smoking rate and lung cancer rate by country, we need to read in this data so we can generate our plot. We will add another code chunk to prepare the data.

```{r data}
smoking <- read_csv("data/smoking_cancer.csv")
```

Plot

Now that we have the data, we need to produce the plot. Let’s create it using the most recent year in our dataset:

```{r smoking_cancer}
smoking %>%
  filter(year == max(year)) %>% 
  ggplot() + 
  aes(x = smoke_pct, y = lung_cancer_pct, color=continent, size=pop/1000000) +
  geom_point() +
  labs(x = "Percent of people who smoke", y = "Percent of people with lung cancer",
       title= "Are lung cancer rates associated with smoking rates?", size="Population (in millions)")
```

Table

Let’s say we also want to include a table in our report that summarizes the number of countries, the minimum smoker percent, and the maximum smoker percent.

```{r}
smoking %>% 
  summarize(min_smoke = min(smoke_pct),
            median_smoke = median(smoke_pct),
            max_smoke = max(smoke_pct))
```

Knitting

Now we can knit our document to see how our report looks! Use the knit button in the top left of the screen.

Amazing! We’ve created a report!

It’s looking pretty good, but there seem to be a few extra bits that we might not need in the report. For example, what if we want to make a report that doesn’t print out all of the messages from the tidyverse? Or a report that doesn’t show the code? And the table is a bit ugly. Let’s make things a bit prettier.

Customizing the report

Table format

We can make the table prettier using the R function kable(). We can give the kable() function a tibble and it will format it to a nice looking table in the report. The kable() function comes from the knitr packages, so what do we have to do before using the function? We have to load the knitr library.

# load library
library(knitr)

# print kable
smoking %>% 
   summarize(min_smoke = min(smoke_pct),
             median_smoke = median(smoke_pct),
             max_smoke = max(smoke_pct)) %>%
  kable()

min_smoke	median_smoke	max_smoke
3.118639	24.36522	46.91547

Messages

How do we get rid of the tidyverse messages? One way to do this is by saying include = FALSE in the curly brackets for that code chunk:

```{r packages, include=FALSE}
library(tidyverse)
```

This will get rid of the code and the corresponding messages. But what if we want to include the code so that people know we loaded the tidyverse? In this case, we can say message=FALSE inside the curly brackets:

```{r packages, message=FALSE}
library(tidyverse)
```

Code

We can also see the code that was used to generate the plot. Depending on the purpose and audience for your report, you may or may not want to include the code. If you don’t want the code to appear, how can you prevent it?

Code chunk options

Which of the following would lead to a report with no code chunks? For the ones that would not work, what would happen instead?

Add echo = FALSE inside the curly brackets of the plotting code chunk.

Add include = FALSE inside the curly brackets of the plotting code chunk.

Add echo = FALSE inside the curly brackets of the setup code chunk.

Change echo = TRUE to echo = FALSE in the knitr::opts_chunk$set() function in the first code chunk.

Solution

This would only remove the code for the plotting code chunk, but not the packages code chunk.

This would remove the code but also the plot from the report.

This would not change anything because that code chunk is already excluded because include = FALSE.

This would remove all code from the output file (what we want).

Formatting

Back to top

We now know how to create a report with R Markdown. Maybe we also want to format the report a little bit to structure our thought process in a useful way (e.g., sections) and make it visually appealing? Markdown is like a simple programming language when it comes to syntax. Let’s try to figure out some syntax together. Suppose we wanted to create sections in our report.

Headers

To create different sections by using headers and sub-headers, you can use the # (pound/hash) sign. Our main headers have one # (e.g. # Main Header Here) and to create subheaders we add additinal #s (e.g. ## First subheader and ### Second subheader)

Lists

To create a bulleted list in R Markdown, you can use the - (dash) or the * (asterisk). Create a bulleted list with three items:

* The name of your currently favorite programming language 
* The name of a function you have so far found most useful 
* One thing you want to learn next on your programming journey

Bold and italics

Use the Internet or the R Markdown reference guide to figure out how to: 1) bold the text of the first bullet point, 2) italicize the text of the second bullet point, 3) bold and italicize the text of the third bullet point.

Solution

Italics can be generated by enclosing the text in _ (single underscores) or * (single asterisks), and bold in __ (double underscores) or ** (double asterisks). To use both, use three instead underscores (___) or asterisks (***) instead.

Okay, now how do you think we’d turn our bulleted list into a numbered list?

You can change the bullets to numbers with dots after them:

The name of your currently favorite programming language 
The name of a function you have so far found most useful 
One thing you want to learn next on your programming journey

Or, even better, you can just make them all 1. and markdown will be smart enough to number them in order. This is super useful if you end up wanting to add in an item in the middle of the list:

The name of your currently favorite programming language 
The name of a function you have so far found most useful 
One thing you want to learn next on your programming journey

Add to our report: association between air pollution and lung cancer

We have a pretty nice looking report, but we still haven’t included anything about the association between lung cancer and air pollution per capita. Let’s add a section to our markdown document in the following steps:

Make a new header and write a 1-2 sentence description of what you will be plotting.

Create a new code chunk

Make a plot with air pollution per capita on the x axis and lung cancer on the y axis

Make a table with summary statistics including the minimum, median, and maximum air pollution values.

BONUS: Merge the table we created earlier with the table you created here with rows for smoking and air pollution, and a column for each of the summary statistics.
Solution

One option to create a code chunk is to type it out. You can also see other options above. Then you have to read in the data and create the plot and table:
smoking_pollution <- read_csv('data/smoking_pollution.csv')
Rows: 191 Columns: 7
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (2): country, continent
dbl (5): year, pop, smoke_pct, lung_cancer_pct, pollution

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
smoking_pollution %>%
  ggplot(aes(x = pollution/pop, y = lung_cancer_pct)) + 
  geom_point() +
  labs(x = "Ambient pollution (micrograms/cubic meter) per capita", y = "Percent of people with lung cancer",
       title = "Are lung cancer rates associated with pollution per capita?")
smoking_pollution %>% 
 summarize(min_pol = min(pollution),
            median_smoke = median(pollution),
            max_smoke = max(pollution))
# A tibble: 1 × 3
  min_pol median_smoke max_smoke
    <dbl>        <dbl>     <dbl>
1    4.69         24.5      78.2
Bonus: you can use pivot_longer() and group_by() followed by summarize():
smoking_pollution %>%
  pivot_longer(c(smoke_pct, pollution)) %>% 
  group_by(name) %>% 
  summarize(min = min(value),
            median = median(value),
            max = max(value))
# A tibble: 2 × 4
  name        min median   max
  <chr>     <dbl>  <dbl> <dbl>
1 pollution  4.69   24.5  78.2
2 smoke_pct  3.12   24.4  46.9
Notice that we used c() to provide pivot_longer() with the two column names that we wanted to pivot. c() stands for “combine”; this function combines the two values into what we call a vector.

Bonus: Dynamically updating text using mini R chunks

Sometimes, you want to describe your data or results (like our plot) to the audience in text but the data and results may still change as you work things out. R Markdown offers an easy way to do this dynamically, so that the text updates as your data or results change!

Say we want to get the number of countries present in our dataset. Previously, we learned about the distinct() function that returns distinct values. There’s a very similar function called n_distinct() that returns instead the number of distinct values:
n_countries <- smoking %>%
  select(country) %>%
  n_distinct()
Now, all we need to do is reference the values we just computed to describe our plot. To do this, we enclose each value in one set of backticks (`r some_R_variable_name `), while the r part once again indicates that it’s a chunk of R code. When we knit our report, R will automatically fill in the values we just created in the above code chunk. Note that R will automatically update these values every time our data might change (if we were to decide to drop or add countries to this analysis, for example).
There are `r n_countries ` countries in our dataset. 
Try knitting the document and see what happens!

Applying it to your own data

Now it’s time to merge all of your analyses with your own data together into a report. Fill out the worksheet to outline your report and then start making it!

Key Points

R Markdown is a useful way to create a report that integrates text, code, and figures.

Options such as include and echo determine what parts of an R code chunk are included in the R Markdown report.

R Markdown can render HTML, PDF, and Microsoft Word outputs.

Conclusion

Overview

Teaching: 15 min
Exercises: min

Questions

What do I do after the workshop to apply what I learned and keep learning more?

How do I deal with coding errors (i.e. debug)?

Objectives

Learn how to get help with code via the Internet and reaching out to others.

Workshop summary & moving forward
Learning more and getting help

Workshop summary & moving forward

Congratulations on completing the workshop! You learned some basic procedures for importing, managing, visualizing and reporting your data. The absolute best way to continue improving your skills is to use R in your own work, e.g. to automate a task, to analyze data, or to create reports.

Brainstorm session: how to use R as much as possible

How can you use R in your work to be able to keep improving your skills?

Solution:

Is there a task you can automate, data you wish to analyze or visualize, or reports that you need to make?

As you continue on your coding journey, you will want to learn new data processing and analysis techniques.

As we complete the course, we want to share with you some tips and tricks that have helped us on our own programming journeys.

Learning more and getting help

The Internet

The Internet is your best friend.

If you get stuck, use your favorite search engine to look for an answer to your question. An example is “how to import an excel spreadsheet into R.” Typically, you will find step-by-step online documentation that you can adapt for your own purposes.

If you want to learn something new, use your favorite search engine to look for tutorials and other resources related to the topic.

Additional coding topics

There are some coding concepts that we did not have time to cover in this workshop, but are important to learn as you continue on your journey and begin to perform more sophisticated data analysis projects. We’ve provided some links below, but feel free to search for other explanations and tutorials as well.

Lists

Functions

Conditionals

Loops and apply statements

Here is a nice tutorial on conditionals, loops, and functions all together.

Domain-specific analyses

We encourage you to investigate domain-specific packages and software that will help you perform specific tasks related to your own research. You can find these packages through:

Other people in the field.

Internet searches with keywords related to the topic of interest, including R (the programming language you’re interested in using; e.g. “find pairwise distances for DNA sequences in R”).

Reach out to others

We want to be a resource for you after the workshop ends, and we also want you all to be a resource for each other.

You can email us whenever you want with questions! If it’s a quick thing, we can figure out over email, otherwise we can set up a time to chat.

Here are our emails:

Zena: zena.lapp@duke.edu
Christine: christine.markwalter@duke.edu

What to include when asking for help

A brief summary of what you are trying to accomplish (your ultimate goal, distilled into one specific question).

Brief description of what you’ve tried so far.

Description of the problem you’re having - the exact code you used, the expected output, the actual error/output.

A minimal, reproducible example. Include the code and data (or made-up data) you need to reproduce the error.

Code club

For additional consistent support, we will be hosting a monthly virtual code club where we will discuss different coding topics and troubleshoot issues you may be having with your own data. Please let us know if you would like to participate.

Key Points

Using R regularly is the best way to improve your skills.

When it comes to trying to figure out how to code something, and debugging, Internet searching is your best friend.

Don’t be afraid to reach out to others and ask for help.

DiscovR Workshop Curriculum

Introduction to The DiscovR Workshop

Overview

Who is this workshop for?

Introducing the instructors and facilitators

Introducing participants

What will the workshop cover?

Asking questions and getting help

Other miscellaneous things

Key Points

Getting Started with R

Overview

Contents

Why learn to program?

Solution:

Introduction to R and RStudio

Bonus: But why R and not Python?

Comments

Foundational topics

Functions

Pro-tip

Position of the arguments in functions

Solution

Bonus Exercise: taking logarithms

Solution

Objects

Predicting object contents

Solution

Guidelines on naming objects

Bonus Exercise: Bad names for objects

Solution

Getting unstuck

Quotes vs. No Quotes

Glossary of terms

Key Points

R for Plotting

Overview

Contents

The “goal” of the workshop

Overview of the lesson

Exercise: Create a new R Script file

Solution

Directory structure

Exercise: File organization

Solution

Loading and reviewing data

The tidyverse vs Base R

Bonus: What’s with all those messages???

Bonus: Characters vs. factors

Data objects

Bonus Exercise: Reading in an excel file

Solution

Our first plot

Creating our first plot

What do we want to plot?

Solution

Quotes vs. No Quotes (refresher)

Mapping lung cancer rates to the y axis

Solution

Color the points by continent

Solution

Bonus Exercise: Changing colors

Solution

Changing point sizes

Solution

Bonus Exercise: Changing shapes

Solution

Storing our plot

Solution

Saving our first plot

Debugging code

Understanding common bugs

Solution

Recap of what we’ve learned so far

Pro-tip

Bonus Exercise: Make your own scatter plot

Solution

Plotting for data exploration

Discrete Plots

Plotting and interpreting box plots

Inside `aes()` or `geom`?

Bonus: Aesthetics inside the `ggplot()` function

Narrow down rows with `filter()`

Subset columns using `select()`

Exercise: use `filter()` and `select()` to narrow down our dataframe to only the `location_name` and `median` for 1990

Grouping and counting rows using `group_by()` and `count()`

Make new variables with `mutate()`