This lesson is in the early stages of development (Alpha version)

R for Data Analysis

Overview

Teaching: 75 min
Exercises: 15 min
Questions
  • How can I summarise my data in R?

  • How can R help make my research more reproducible?

Objectives
  • To become familiar with the functions of the dplyr and tidyr packages.

  • To be able to create plots and summary tables to answer analysis questions.

Contents

  1. Day 2 review
  2. Overview of the lesson
  3. Get stats fast with summarise()
  4. Plotting for exploratory data analysis
  5. Bonus content (#bonus-content)
  6. Calculating percentages
  7. Changing the shape of the data
  8. Plotting wide data
  9. Additional practice
  10. Applying it to your own data

Day 2 review

Yesterday we learned all about data cleaning, but we didn’t cover everything.

  1. Make a list of what we learned yesterday related to data cleaning.
  2. Make a list of things we didn’t learn yesterday that you sometimes have to do while data cleaning.
  3. Choose one item from the list of things we didn’t learn and use the Internet to search for a way to do it using the tidyverse.

Overview of the lesson

So far, you’ve learned how to load, plot, merge, and clean data. In the process, you’ve learned a lot of new functions which are useful for transforming data. Today, we are going to put all those new skills you learned together and learn a few new functions that will be really helpful for exploratory data analysis. We’ll start with a few examples of how plotting can be a really useful tool for exploratory data analysis. Then, we’ll learn a function that will help us get summary statistics for our data and compare those summary statistics to plots. We’ll also learn how to calculate proportions and percentages. Finally, we’ll learn how to change the shape of data to make certain analyses more straight forward. First, let’s read in the data on smoking, lung cancer rates, and pollution that we generated yesterday:

library(tidyverse) 
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.2     ✔ readr     2.1.4
✔ forcats   1.0.0     ✔ stringr   1.5.0
✔ ggplot2   3.4.2     ✔ tibble    3.2.1
✔ lubridate 1.9.2     ✔ tidyr     1.3.0
✔ purrr     1.0.1     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
smoking_pollution <- read_csv("data/smoking_pollution.csv")
Rows: 191 Columns: 7
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (2): country, continent
dbl (5): year, pop, smoke_pct, lung_cancer_pct, pollution

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Get stats fast with summarise()

Back to top

Let’s say we would like to know the mean (average) smoking rate in the dataset. R has a built in function function called mean() that will calculate this value for us. We can apply that function to our smoke_pct column using the summarise() function. Here’s what that looks like:

smoking_pollution %>%
    summarise(mean_smoke_pct = mean(smoke_pct))
# A tibble: 1 × 1
  mean_smoke_pct
           <dbl>
1           23.9

When we call summarise(), we can use any of the column names of our data object as values to pass to other functions. summarise() will return a new data object and our value will be returned as a column.

Note: The summarise() and summarise() functions perform identical functions.

The mean_smoke_pct= part tells summarise() to use “mean_smoke_pct” as the name of the new column. Note that you don’t have to have quotes around this new name as long as it starts with a letter and doesn’t include a space.

When you call summarise(), you can also create more than one new column. To do so, you must separate your columns with a comma. Building on the code from above, let’s add a new column that calculates the minimum and maximum percent of smokers.

smoking_pollution %>%
  summarise(mean_smoke_pct=mean(smoke_pct),
            min_smoke_pct=min(smoke_pct),
            max_smoke_pct=max(smoke_pct))
# A tibble: 1 × 3
  mean_smoke_pct min_smoke_pct max_smoke_pct
           <dbl>         <dbl>         <dbl>
1           23.9          3.12          46.9

Perhaps one of the most powerful ways to use summarise() is to combine it with group_by(). This enables you to calculate summary statistics for specific groups. For example, suppose we wanted to calculate the the mean, min, and max smoke_pct for each continent. How would you modify the code above?

smoking_pollution %>%
    group_by(continent) %>%
    summarise(mean_smoke_pct=mean(smoke_pct),
            min_smoke_pct=min(smoke_pct),
            max_smoke_pct=max(smoke_pct))
# A tibble: 6 × 4
  continent     mean_smoke_pct min_smoke_pct max_smoke_pct
  <chr>                  <dbl>         <dbl>         <dbl>
1 Africa                  15.0          3.81          31.9
2 Asia                    25.1          3.12          39.7
3 Europe                  34.2         21.8           46.1
4 North America           16.1          6.55          33.1
5 Oceania                 33.3         21.2           46.9
6 South America           24.4          8.33          39.5

Exercise: Summary stats and boxplots

Part 1: Use group_by() and summarise() to find the median, min, max, and interquartile range of lung_cancer_pct for each continent.

Part 2: Make a box plot of pollution on the y axis and continent on the x axis. Compare your plot to your table. What do you notice? Which do you think is easier to interpret?

Solution

Part 1: Use group_by() and summarise() to find the median, min, max, and interquartile range of pollution for each continent.

smoking_pollution %>%
  group_by(continent) %>%
  summarise(med_pollution = median(pollution),
min_pollution = min(pollution),
max_pollution = max(pollution),
iqr_pollution = IQR(pollution)) 
# A tibble: 6 × 5
  continent     med_pollution min_pollution max_pollution iqr_pollution
  <chr>                 <dbl>         <dbl>         <dbl>         <dbl>
1 Africa                35.7          14.2           70.6         24.1 
2 Asia                  29.4           7.63          78.2         22.4 
3 Europe                19.6           6.81          43.7          9.69
4 North America         18.5           8.14          38.1          4.64
5 Oceania                8.05          4.69          14.3          2.45
6 South America         20.2           9.97          45.3         12.2 

Part 2: Make a box plot of pollution on the y axis and continent on the x axis. Compare your plot to your table. What do you notice?

smoking_pollution %>%
    ggplot(aes(x = continent, y = pollution)) +
    geom_boxplot()

When comparing your table to your plot, you’ll notice that the dark horizontal lines represent median values. The boxes have lengths equal to the interquartile range (IQR). And the highest and lowest values for each continent match the table as well. The plot makes it easier to see differences between continents. The table provides finer details for comparison. What you choose to report will depend on whether you want to bring attention to those finer details or whether you want to discuss overall trends.

Plotting for exploratory data analysis

Back to top

For our analysis, we have three questions we’d like to answer:

  1. Is there a relationship between population and ambient pollution levels (in micrograms per cubic meter)?
  2. Which continent has the highest pollution levels per capita?
  3. Is there a relationship between ambient pollution levels per capita and lung cancer rates?

1) Is there a relationship between population and ambient pollution levels (in micrograms per cubic meter)?

To answer this question, we’ll plot ambient pollution levels against population using a scatter plot. It will help to scale the x axis (population) log 10.

smoking_pollution %>%
  ggplot(aes(x = pop, y = pollution)) +
  geom_point() +
  scale_x_log10() +
  labs(x = "Population", y = "Ambient pollution levels (micrograms/cubic meter)", 
       size = "Population\n(millions)") +
  theme_bw()

We observe a positive association between ambient pollution levels and population.

To help clarify the association, we can add a fit line through the data using geom_smooth(method = "lm"). Notice we added the method = "lm" argument. This tells geom_smooth() that we would like a linear model (lm) fit to the data.

smoking_pollution %>%
  ggplot(aes(x = pop, y = pollution)) +
  geom_point() +
  geom_smooth(method = "lm") +
  scale_x_log10() +
  labs(x = "Population", y = "Ambient pollution levels (micrograms/cubic meter)", size = "Population\n(millions)") +
  theme_bw() 
`geom_smooth()` using formula = 'y ~ x'

To answer our first question, we observe a positive association between population and ambient pollution. In other words, countries with higher populations tend to have higher ambient pollution levels. It is very important to remember that associations are not indicative of causality and there could be confounding variables that may be playing into this apparent relationship. Can you think of any confounding factors we haven’t accoutned for?

Challenge: 2) which continent has the highest pollution levels per capita?

To answer this question, we need to calculate the pollution levels per capita for each country using mutate(). Then plot a boxplot to look at these levels by continent. Hint: it may help to scale the y axis log10

Solution:

smoking_pollution %>%
  mutate(pollution_capita = pollution/pop) %>%
  ggplot(aes(x = continent, y = pollution_capita)) +
  geom_boxplot() +
  scale_y_log10() +
  labs(y = "Pollution (micrograms/cubic meter) per capita")+
  theme_bw()

Which continent has the highest pollution levels per capita? What other factors do you think could be driving this observation?

Challenge: 3) Is there a relationship between ambient pollution levels per capita and lung cancer rates?

To answer this question, let’s make a scatter plot with ambient pollution levels on the x axis and lung cancer rates on the y axis. Hint: Make sure to scale the x-axis log10.

Solution:

smoking_pollution %>%
  mutate(pollution_capita = pollution/pop) %>%
  ggplot(aes(x = pollution_capita, y = lung_cancer_pct)) +
  geom_point() +
  scale_x_log10() +
  labs(x = "Pollution (micrograms/cubic meter) per capita", y = "Percent of people with lung cancer")+
  theme_bw()

There does not appear to be a direct relationship between pollution and lung cancer rates.

Bonus content

Back to top

Calculating percentages

Back to top

Finding percentages using dplyr can be a little bit complicated. However, it’s a very useful skill! We’ve included an exercise here that provides an example for how to caluclate percentages.

Percentages

What percentage of the global population in 1990 did Africa make up? What percentage of the population in Africa did Kenya make up?

Solution

Create a new variable using group_by() and mutate() that calculates percentages for the pop variable.

smoking_pollution %>%
  mutate(total_pop = sum(pop)) %>% #total_pop is the global population
  group_by(continent) %>%  #grouping by continent allows us to calculate the population on each continent
  mutate(cont_pop = sum(pop), #cont_pop is the continental population
         cont_percent = cont_pop/total_pop * 100, #cont_percent is the percent of the global population for the continent
         country_cont_pct = pop/cont_pop * 100) %>% #country_cont_pct is the percent of the continent population for a given country
  select(country, continent, cont_percent, country_cont_pct) %>%
  filter(country == "Kenya")
# A tibble: 1 × 4
# Groups:   continent [1]
  country continent cont_percent country_cont_pct
  <chr>   <chr>            <dbl>            <dbl>
1 Kenya   Africa            12.1             3.77

This table shows that Kenya makes up 4% of the population of Africa, and Africa makes up 12% of the global population.

Changing the shape of the data

Back to top

Data comes in many shapes and sizes, and one way we classify data is either “wide” or “long.” Data that is “long” has one row per observation. The smoking data is in a long format. We have one row for each country for each year and each different measurement for that country is in a different column. We might describe this data as “tidy” because it makes it easy to work with ggplot2 and dplyr functions (this is where the “tidy” in “tidyverse” comes from). As tidy as it may be, sometimes we may want our data in a “wide” format. Typically in “wide” format each row represents a group of observations and each value is placed in a different column rather than a different row. For example, let’s read in a smoking and lung cancer data set that covers many years and take a look at it:

smoking_cancer <- read_csv("data/smoking_cancer.csv")
Rows: 5719 Columns: 6
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (2): country, continent
dbl (4): year, pop, smoke_pct, lung_cancer_pct

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

It has one row for each country for each year the data were collected. But maybe we want only one row per country and want to spread the percent of smokers values into different columns (one for each year).

The tidyr package contains the functions pivot_wider() and pivot_longer() that make it easy to switch between the two formats. The tidyr package is included in the tidyverse package so we don’t need to do anything to load it.

Let’s create a wide version of our data using pivot_wider():

smoking_cancer %>%
  group_by(country, continent, year) %>% 
  summarise(smoke_pct = mean(smoke_pct)) %>%
  pivot_wider(names_from = year, values_from = smoke_pct)
`summarise()` has grouped output by 'country', 'continent'. You can override
using the `.groups` argument.
# A tibble: 191 × 32
# Groups:   country, continent [191]
   country     continent `1990` `1991` `1992` `1993` `1994` `1995` `1996` `1997`
   <chr>       <chr>      <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
 1 Afghanistan Asia        3.12   3.29   3.53   3.77   4.00   4.25   4.53   4.82
 2 Albania     Europe     24.2   24.1   24.0   24.0   23.9   23.8   23.8   23.9 
 3 Algeria     Africa     18.9   18.7   18.6   18.4   18.2   18.0   17.8   17.6 
 4 Andorra     Europe     36.6   36.5   36.3   36.2   35.9   35.5   35.2   34.9 
 5 Angola      Africa     12.5   12.4   12.2   12.0   11.8   11.5   11.3   11.0 
 6 Antigua an… North Am…   6.80   6.94   7.06   7.17   7.27   7.36   7.43   7.47
 7 Argentina   South Am…  30.4   30.1   29.9   29.7   29.5   29.4   29.2   29.1 
 8 Armenia     Europe     30.5   30.4   30.2   30.0   29.9   29.7   29.6   29.5 
 9 Australia   Oceania    29.3   28.8   28.4   27.9   27.5   27.0   26.4   25.9 
10 Austria     Europe     35.4   35.8   36.2   36.6   37.0   37.4   37.8   38.1 
# ℹ 181 more rows
# ℹ 22 more variables: `1998` <dbl>, `1999` <dbl>, `2000` <dbl>, `2001` <dbl>,
#   `2002` <dbl>, `2003` <dbl>, `2004` <dbl>, `2005` <dbl>, `2006` <dbl>,
#   `2007` <dbl>, `2008` <dbl>, `2009` <dbl>, `2010` <dbl>, `2011` <dbl>,
#   `2012` <dbl>, `2013` <dbl>, `2014` <dbl>, `2015` <dbl>, `2016` <dbl>,
#   `2017` <dbl>, `2018` <dbl>, `2019` <dbl>

Notice here that we tell pivot_wider() which columns to pull the names we wish our new columns to be named from the year variable, and the values to populate those columns from the smoke_pct variable. (Again, neither of which have to be in quotes in the code when there are no special characters or spaces - certainly an incentive not to use special characters or spaces!) We see that the resulting table has new columns by year, and the values populate it with our remaining variables dictating the rows.

Plotting wide data

Back to top

Let’s make a plot with our wide data comparing percent of smokers in 1990 to percent of smokers in 2010 to see how it has changed for each country.

smoking_cancer %>%
  group_by(country, continent, year) %>% 
  summarise(smoke_pct = mean(smoke_pct)) %>%
  pivot_wider(names_from = year, values_from = smoke_pct) %>% 
  ggplot(aes(x = 1990, y = 2010)) +
  geom_point()
`summarise()` has grouped output by 'country', 'continent'. You can override
using the `.groups` argument.

Hmm that’s not what we want. ggplot just plotted the numbers 1990 and 2010 instead of the data from the years. That’s because it evaluates those as numbers instead of column names. To fix this, we can add a prefix to the years in pivot_wider():

smoking_cancer %>%
  group_by(country, continent, year) %>% 
  summarise(smoke_pct = mean(smoke_pct)) %>%
  pivot_wider(names_from = year, values_from = smoke_pct, names_prefix = 'y') %>% 
  ggplot(aes(x = y1990, y = y2010)) +
  geom_point()
`summarise()` has grouped output by 'country', 'continent'. You can override
using the `.groups` argument.

Alright, now we have a plot with the mean percent of smokers in in 1990 on the x axis and the mean percent of smokers in 2010 on the y axis, and each point represents a country. However, the different ranges on the x and y axis make it hard to compare the points. Let’s fix that by adding a line at y=x.

smoking_cancer %>%
  group_by(country, continent, year) %>% 
  summarise(smoke_pct = mean(smoke_pct)) %>%
  pivot_wider(names_from = year, values_from = smoke_pct, names_prefix = 'y') %>% 
  ggplot(aes(x = y1990, y = y2010)) +
  geom_point() +
  geom_abline(intercept = 0, slope = 1)
`summarise()` has grouped output by 'country', 'continent'. You can override
using the `.groups` argument.

It seems like in most countries the percent of smokers has decreased from 1990 to 2010, since most of the points fall below the line y = x. However, there are some countries where smoking has increased (i.e. the points are above the line y = x). Let’s figure out which those are!

Bonus: Identifying countries with more smokers in 2010 than 1990

Use what you’ve learned from today to figure out which countries had higher smoking percentage in 2010 than 1990.

Bonus: Order the data frame from greatest to smallest difference. HINT: The arrange() function can help you do this.

Solution

smoking_cancer %>%
  group_by(country, continent, year) %>% # group by the columns you want to keep
  summarise(smoke_pct = mean(smoke_pct)) %>% # summarise to get one value per country per year
  pivot_wider(names_from = year, values_from = smoke_pct, names_prefix = 'y') %>% # pivot wider
  mutate(diff = y2010 - y1990) %>% # find the difference between the years of interest
  select(country, continent, diff) %>% # select the columns of interest
  filter(diff > 0) %>% # filter to ones where the difference is greater than zero
  arrange(-diff) # bonus - arrange by diff, highest to lowest
`summarise()` has grouped output by 'country', 'continent'. You can override
using the `.groups` argument.
# A tibble: 42 × 3
# Groups:   country, continent [42]
   country                continent  diff
   <chr>                  <chr>     <dbl>
 1 Bosnia and Herzegovina Europe     9.33
 2 Lebanon                Asia       7.91
 3 Afghanistan            Asia       6.08
 4 Albania                Europe     5.23
 5 Indonesia              Asia       5.07
 6 Saudi Arabia           Asia       4.21
 7 Uzbekistan             Asia       4.04
 8 Kiribati               Oceania    3.68
 9 Mali                   Africa     3.03
10 Djibouti               Africa     2.83
# ℹ 32 more rows

Additional practice

Let’s go back to the very first question we talked about today: Is there a relationship between population and ambient pollution levels (in micrograms per cubic meter)? In addition to making a scatterplot, another way to get at this question is by calculating a correlation coefficient. We will cover two correlation coefficients here: Pearson’s (which assumes a linear relationship) and Spearman’s (which doesn’t assume a linear relationship).

There is a function in base R that calculates correlation coefficients (cor()), but is kind of hard to use with the tidy way that we’re used to doing things. So we’re going to download another package that’s part of the tidyverse, but not the core set of packages that we downloaded originally, called corrr (that’s not a typo - there are actually 3 r’s at the end). This package has a function called correlate() that makes it easy to find correlations between variables.

Take the following steps to calculate the Pearson and Spearman correlations between population and ambient pollution levels:

  1. Install and load the corrr package.
  2. Subset smoking_pollution to the population and ambient pollution level columns.
  3. Find the Pearson correlation between smoking and pollution using the correlate() function from the corrr package.
  4. Find the Spearman correlation between smoking and pollution using the correlate() function from the corrr package.

Solution

# install.packages('corrr') # only run this once
library(corrr)
smoking_pollution %>%
  select(pop, pollution) %>%
  correlate(method = 'pearson')
Correlation computed with
• Method: 'pearson'
• Missing treated using: 'pairwise.complete.obs'
# A tibble: 2 × 3
  term         pop pollution
  <chr>      <dbl>     <dbl>
1 pop       NA         0.183
2 pollution  0.183    NA    
smoking_pollution %>%
  select(pop, pollution) %>%
  correlate(method = 'spearman')
Correlation computed with
• Method: 'spearman'
• Missing treated using: 'pairwise.complete.obs'
# A tibble: 2 × 3
  term         pop pollution
  <chr>      <dbl>     <dbl>
1 pop       NA         0.310
2 pollution  0.310    NA    

Additional practice

Remember that we made a scatter plot of year vs. population, separated into a plot for each contient, and that it had 2 outliers:

library(tidyverse)
smoking <- read_csv('data/smoking_cancer.csv')

smoking %>% 
  ggplot(aes(x=year,y=pop)) +
  geom_point() +
  facet_wrap(vars(continent))

Write some code to figure out which countries these are (even if you already know!).

Solution

smoking %>% filter(pop > 5e8) %>% select(country) %>% distinct()
# A tibble: 2 × 1
  country
  <chr>  
1 China  
2 India  

Here we used the distinct() function, which we first saw yesterday. This function is not required to find the answer to this question, but it helps us get the answer a bit more quickly.

Next, plot year vs. population separated into a plot for each continent but excluding the 2 outlier countries. Note that usually you don’t want to exclude certain data points from a plot because it is misleading (see Bonus 2 for an alternative).

Solution

smoking %>% 
filter(country != 'China') %>% 
filter(country != 'India') %>% 
ggplot(aes(x=year,y=pop)) +
geom_point() +
facet_wrap(vars(continent))

Another solution is to use only one filter command and separate the two true/false statements with an ampersand (&) or comma (,), which means that you want to exclude both China and India:

smoking %>% 
filter(country != 'China' & country != 'India') %>% 
ggplot(aes(x=year,y=pop)) +
geom_point() +
facet_wrap(vars(continent))

Bonus 1: Instead of hard-coding the two countries to remove them, remove the two outliers by combining your solutions to the first two questions.

Solution

smoking %>% 
filter(pop < 5e8) %>% 
ggplot(aes(x=year,y=pop)) +
geom_point() +
facet_wrap(vars(continent))

Bonus 2: How can you make the differences between countries more visible on the plot without excluding the two countries you identified above?

Solution

You can scale the y axis using a log10 scale to make the differences more visible:

smoking %>% 
ggplot(aes(x=year,y=pop)) +
geom_point() +
scale_y_log10() +
facet_wrap(vars(continent))

Applying it to your own data

Continue working on your project. Now you can generate some summary statistics as well!

Key Points

  • Package loading is an important first step in preparing an R environment.

  • There are many useful functions in the tidyverse packages that can aid in data analysis.