ggplot 101 recitation 🎃

Week 5

Author

Daniel Quiroz, Jessica Cooperstone

Intro

We are going to practice using ggplot today, focusing on the data, aesthetic, and geom layers. We are going to use data from the TidyTuesday project. For this recitation, we are going to use the Giant Pumpkins data which is collected from the Great Pumpkin Commonwealth.

At the end of of this module you will create of of this descriptive plots

Question: How can we replicate this plot?

Goals of this recitation

Work with real world data

  • Import data from github
  • Modify variables types
  • Select observations with certain values
  • Wrangle some more
  • Practice plotting

Illustration taken from https://www.allisonhorst.com

Download data from Github

When you open the github page you will see a file called pumpkins.csv. You also are introduced about the details of the data (i.e., variables, variable types, descriptions), as well as how to import the it.

First thing first, we are going to import the data by reading the csv file with the Github link provided. You can also read the data in by downloading it manually, saving it, and then loading it.

# load libraries
library(tidyverse)

# Import giant pumpkins data
pumpkins_raw <- readr::read_csv('WHAT-GOES-HERE??')

Once we have imported our data, how can you check it out?

glimpse(pumpkins_raw)
Rows: 28,065
Columns: 14
$ id                <chr> "2013-F", "2013-F", "2013-F", "2013-F", "2013-F", "2…
$ place             <chr> "1", "2", "3", "4", "5", "5", "7", "8", "9", "10", "…
$ weight_lbs        <chr> "154.50", "146.50", "145.00", "140.80", "139.00", "1…
$ grower_name       <chr> "Ellenbecker, Todd & Sequoia", "Razo, Steve", "Ellen…
$ city              <chr> "Gleason", "New Middletown", "Glenson", "Combined Lo…
$ state_prov        <chr> "Wisconsin", "Ohio", "Wisconsin", "Wisconsin", "Wisc…
$ country           <chr> "United States", "United States", "United States", "…
$ gpc_site          <chr> "Nekoosa Giant Pumpkin Fest", "Ohio Valley Giant Pum…
$ seed_mother       <chr> "209 Werner", "150.5 Snyder", "209 Werner", "109 Mar…
$ pollinator_father <chr> "Self", NA, "103 Mackinnon", "209 Werner '12", "open…
$ ott               <chr> "184.0", "194.0", "177.0", "194.0", "0.0", "190.0", …
$ est_weight        <chr> "129.00", "151.00", "115.00", "151.00", "0.00", "141…
$ pct_chart         <chr> "20.0", "-3.0", "26.0", "-7.0", "0.0", "-1.0", "-4.0…
$ variety           <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …

Do some of these variables contain more than one piece of information?

  • What is embedded within the variable id?
  • What type of info does id contain?
  • What types of variables are place and weight_lbs? Are there any limitations to plotting these variable types?

Wrangling

Turn one character column into two ✂️

From both looking at the data, and reading about the variable id on the documentation page, you see that it contains two type of observations. To use them separately, we need to separate this column into two columns such like year and type.

Try doing this with the function separate() from the tidyr package to do this. And you will obtain the following data

pumpkins_raw %>%
   separate(WHAT-GOES-HERE)

Select observations by their values 🎃

Now that you separated the year and crop type, keep only the data for Giant Pumpkins. Hint, you can use the filter() function from the dplyr package.

Illustration taken from https://www.allisonhorst.com
pumpkins_raw %>%
   filter(...predicate/condition...) 

Now that you are familiar with the filter(), retain only the observations that were the winners or those in the first place.

pumpkins_raw %>%
   filter(...predicate/condition...) %>%
   filter(...predicate/condition...)

Remove pesky strings 😑

If we were to try and plot our data as it is now we would not get our desired outcome. But try it anyway.

pumpkins_raw %>%
  code-to-separate %>%
  code-to-filter %>%
  code-to-plot

What is weird about this y-axis?

If you take a look at the variables of the weight_lbs column, it contains commas as thousand separator. However, R does not recognize this as a number (and instead views it as a character) so and it has to be removed prior changing the column type.

For this purpose, we are going to remove this annoying character. You can use str_remove() function from the base and stringr package respectively. Here is an example of how both functions work.

wrong_number <- "700,057.58"
wrong_number
[1] "700,057.58"

Using str_remove

stringr::str_remove(string = wrong_number, pattern = ",")
[1] "700057.58"

Remember, we don’t want to just remove the thousands place comma in one number, we want to edit the dataset to remove the comma.

In this case, you can embed str_remove() within the mutate() function, which can create new variables or modify existing ones. In our case, we want to modify the weight_lbs variable.

Illustration taken from https://www.allisonhorst.com
pumpkins_raw %>%
  code-to-separate %>%
  code-to-filter %>%
  mutate(variable = str_remove(arguments-here)) 

Commas, gone! 👏👏👏

Convert character to numeric 🔢

Now the comma is gone, you can simply change the variable weight_lbs from a character to numeric, so it can be plotted like a number., to change the column type, we are going to use the as.numeric() function. Here’s some example about how to use as.numeric().

right_number_chr <- stringr::str_remove(string = wrong_number, pattern = ",")

right_number_number <- as.numeric(right_number_chr)
class(right_number_number)
[1] "numeric"

Let’s add this to our growing pipe.

pumpkins_raw %>%
  code-to-separate %>%
  code-to-filter %>%
  mutate(variable = str_remove(arguments-here)) %>%
  mutate(variable = as.numeric(arguments-here))

Plot

With the weight_lbs variable corrected, we can re-plot.

pumpkins_raw %>%
  code-to-separate %>%
  code-to-filter %>%
  mutate(variable = str_remove(arguments-here)) %>%
  mutate(variable = as.numeric(arguments-here)) %>%
  code-to-plot
`geom_line()`: Each group consists of only one observation.
ℹ Do you need to adjust the group aesthetic?

Where are the lines?

Why do you think the lines aren’t showing up? Hint - look at what variable type year is.

How can you fix this? Hint, you can change year to either numeric or a date. Here are some packages that allow you to deal with dates specifically.

pumpkins_raw %>%
  code-to-separate %>%
  code-to-filter %>%
  mutate(variable = str_remove(arguments-here)) %>%
  mutate(variable = as.numeric(arguments-here)) %>%
  mutate(do-something-with-your-date) %>%
  code-to-plot

Playing around

Try using different geoms besides geom_point() and geom_line(). Which might make sense in this situation?

Can you color all the lines blue?

Can you color the data based on year?

Can you color and change shape based on country?

Can you make a plot showing the distribution of weights of all giant pumpkins entered in 2021?

Can you make a boxplot showing the distribution of weights of all giant pumpkins across all years? Also can you add all the datapoints on top of the boxplot? Is this a good idea? Might there be a better geom to use than a boxplot?

Back to top