ggplot 101 recitation 🎃
Week 5
Intro
We are going to practice using ggplot today, focusing on the data, aesthetic, and geom layers. We are going to use data from the TidyTuesday project. For this recitation, we are going to use the Giant Pumpkins data which is collected from the Great Pumpkin Commonwealth.
At the end of of this module you will create of of this descriptive plots
Question: How can we replicate this plot?
Goals of this recitation
Work with real world data
- Import data from github
- Modify variables types
- Select observations with certain values
- Wrangle some more
- Practice plotting
Download data from Github
When you open the github page you will see a file called pumpkins.csv
. You also are introduced about the details of the data (i.e., variables, variable types, descriptions), as well as how to import the it.
First thing first, we are going to import the data by reading the csv file with the Github link provided. You can also read the data in by downloading it manually, saving it, and then loading it.
# load libraries
library(tidyverse)
# Import giant pumpkins data
<- readr::read_csv('WHAT-GOES-HERE??') pumpkins_raw
Once we have imported our data, how can you check it out?
glimpse(pumpkins_raw)
Rows: 28,065
Columns: 14
$ id <chr> "2013-F", "2013-F", "2013-F", "2013-F", "2013-F", "2…
$ place <chr> "1", "2", "3", "4", "5", "5", "7", "8", "9", "10", "…
$ weight_lbs <chr> "154.50", "146.50", "145.00", "140.80", "139.00", "1…
$ grower_name <chr> "Ellenbecker, Todd & Sequoia", "Razo, Steve", "Ellen…
$ city <chr> "Gleason", "New Middletown", "Glenson", "Combined Lo…
$ state_prov <chr> "Wisconsin", "Ohio", "Wisconsin", "Wisconsin", "Wisc…
$ country <chr> "United States", "United States", "United States", "…
$ gpc_site <chr> "Nekoosa Giant Pumpkin Fest", "Ohio Valley Giant Pum…
$ seed_mother <chr> "209 Werner", "150.5 Snyder", "209 Werner", "109 Mar…
$ pollinator_father <chr> "Self", NA, "103 Mackinnon", "209 Werner '12", "open…
$ ott <chr> "184.0", "194.0", "177.0", "194.0", "0.0", "190.0", …
$ est_weight <chr> "129.00", "151.00", "115.00", "151.00", "0.00", "141…
$ pct_chart <chr> "20.0", "-3.0", "26.0", "-7.0", "0.0", "-1.0", "-4.0…
$ variety <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
Do some of these variables contain more than one piece of information?
- What is embedded within the variable
id
? - What type of info does
id
contain? - What types of variables are
place
andweight_lbs
? Are there any limitations to plotting these variable types?
Wrangling
Turn one character column into two ✂️
From both looking at the data, and reading about the variable id
on the documentation page, you see that it contains two type of observations. To use them separately, we need to separate this column into two columns such like year
and type
.
Try doing this with the function separate()
from the tidyr package to do this. And you will obtain the following data
%>%
pumpkins_raw separate(WHAT-GOES-HERE)
Select observations by their values 🎃
Now that you separated the year and crop type, keep only the data for Giant Pumpkins. Hint, you can use the filter()
function from the dplyr
package.
%>%
pumpkins_raw filter(...predicate/condition...)
Now that you are familiar with the filter()
, retain only the observations that were the winners or those in the first place.
%>%
pumpkins_raw filter(...predicate/condition...) %>%
filter(...predicate/condition...)
Remove pesky strings 😑
If we were to try and plot our data as it is now we would not get our desired outcome. But try it anyway.
%>%
pumpkins_raw -to-separate %>%
code-to-filter %>%
code-to-plot code
What is weird about this y-axis?
If you take a look at the variables of the weight_lbs
column, it contains commas as thousand separator. However, R does not recognize this as a number (and instead views it as a character) so and it has to be removed prior changing the column type.
For this purpose, we are going to remove this annoying character. You can use str_remove()
function from the base and stringr package respectively. Here is an example of how both functions work.
<- "700,057.58"
wrong_number wrong_number
[1] "700,057.58"
Using str_remove
::str_remove(string = wrong_number, pattern = ",") stringr
[1] "700057.58"
Remember, we don’t want to just remove the thousands place comma in one number, we want to edit the dataset to remove the comma.
In this case, you can embed str_remove()
within the mutate()
function, which can create new variables or modify existing ones. In our case, we want to modify the weight_lbs
variable.
%>%
pumpkins_raw -to-separate %>%
code-to-filter %>%
codemutate(variable = str_remove(arguments-here))
Commas, gone! 👏👏👏
Convert character to numeric 🔢
Now the comma is gone, you can simply change the variable weight_lbs
from a character to numeric, so it can be plotted like a number., to change the column type, we are going to use the as.numeric()
function. Here’s some example about how to use as.numeric()
.
<- stringr::str_remove(string = wrong_number, pattern = ",")
right_number_chr
<- as.numeric(right_number_chr)
right_number_number class(right_number_number)
[1] "numeric"
Let’s add this to our growing pipe.
%>%
pumpkins_raw -to-separate %>%
code-to-filter %>%
codemutate(variable = str_remove(arguments-here)) %>%
mutate(variable = as.numeric(arguments-here))
Plot
With the weight_lbs
variable corrected, we can re-plot.
%>%
pumpkins_raw -to-separate %>%
code-to-filter %>%
codemutate(variable = str_remove(arguments-here)) %>%
mutate(variable = as.numeric(arguments-here)) %>%
-to-plot code
`geom_line()`: Each group consists of only one observation.
ℹ Do you need to adjust the group aesthetic?
Where are the lines?
Why do you think the lines aren’t showing up? Hint - look at what variable type year
is.
How can you fix this? Hint, you can change year
to either numeric or a date. Here are some packages that allow you to deal with dates specifically.
%>%
pumpkins_raw -to-separate %>%
code-to-filter %>%
codemutate(variable = str_remove(arguments-here)) %>%
mutate(variable = as.numeric(arguments-here)) %>%
mutate(do-something-with-your-date) %>%
-to-plot code
Playing around
Try using different geoms besides geom_point()
and geom_line()
. Which might make sense in this situation?
Can you color all the lines blue?
Can you color the data based on year?
Can you color and change shape based on country?
Can you make a plot showing the distribution of weights of all giant pumpkins entered in 2021?
Can you make a boxplot showing the distribution of weights of all giant pumpkins across all years? Also can you add all the datapoints on top of the boxplot? Is this a good idea? Might there be a better geom to use than a boxplot?