# Annotating Statistics onto Plots

Week 9

## Introduction

Now that weâ€™ve spent some time going through how to make plots, today we will focus on how to annotate statistics that youâ€™ve calculated to show statistical differences, embedded within your plot. I will go over a few different ways to do this.

The purpose of todayâ€™s session is more to give you practical experience with running and retrieving statistical analysis output, than teaching about the assumptions and background of the test itself. If you are looking for a good statistics class, I would recommend Dr. Kristin Mercerâ€™s HCS 8887 Experimental Design.

### Load libraries and data

Before we get started, letâ€™s load our libraries.

We are going to use data that was collection about body characteristics of penguins on Palmer Station in Antarctica. This data is in a dataframe called `penguins`

in the package `palmerpenguins`

which you can download from CRAN.

```
# install.packages(palmerpenguins)
library(tidyverse)
library(palmerpenguins) # for penguins data
library(rstatix) # for pipeable stats testing
library(agricolae) # for posthoc tests
library(ggpubr) # extension for adding stats to plots
library(glue) # for easy pasting
```

`::kable(head(penguins)) # kable to make a pretty table knitr`

species | island | bill_length_mm | bill_depth_mm | flipper_length_mm | body_mass_g | sex | year |
---|---|---|---|---|---|---|---|

Adelie | Torgersen | 39.1 | 18.7 | 181 | 3750 | male | 2007 |

Adelie | Torgersen | 39.5 | 17.4 | 186 | 3800 | female | 2007 |

Adelie | Torgersen | 40.3 | 18.0 | 195 | 3250 | female | 2007 |

Adelie | Torgersen | NA | NA | NA | NA | NA | 2007 |

Adelie | Torgersen | 36.7 | 19.3 | 193 | 3450 | female | 2007 |

Adelie | Torgersen | 39.3 | 20.6 | 190 | 3650 | male | 2007 |

## 2 group comparisons (t-tests or similar)

Our question: Is there a significant difference in the

`body_weight_g`

of male and female penguins?

Before we run the statistics, letâ€™s make a plot to see what this data looks like.

```
# what are the values for sex?
unique(penguins$sex)
```

```
[1] male female <NA>
Levels: female male
```

```
# plot
<- penguins %>%
(penguins_by_sex drop_na(body_mass_g, sex) %>%
ggplot(aes(x = sex, y = body_mass_g, color = sex)) +
geom_boxplot(outlier.shape = NA) +
geom_jitter(height = 0, width = 0.3) +
scale_x_discrete(labels = c("Female", "Male")) +
theme_minimal() +
theme(legend.position = "none") +
labs(x = "Sex",
y = "Body Mass (g)",
title = "Body mass of penguins by sex",
subtitle = "Collected from Palmer Station, Antarctica",
caption = "Data accessed from the R package palmerpenguins"))
```

It looks like there is a difference here. Before adding the statistics to our plot, letâ€™s:

- test that our data is suitable for running the text we want
- run the statistical test separately from the plot

### Testing assumptions

Briefly, in order to use parametric procedures (like a t-test), we need to be sure our data meets the assumptions for 1) normality and 2) constant variance. This is just one way to do these tests, there are others that I am not going to go over.

#### Normality

We will test normality by the Shapiro-Wilk test using the function `rstatix::shapiro_test()`

. This function is a pipe-friendly wrapper for the function `shapiro.test()`

, which just means you can use it with pipes.

```
%>%
penguins drop_na(body_mass_g, sex) %>% # remove NAs
group_by(sex) %>% # test by sex
shapiro_test(body_mass_g) # test for normality
```

```
# A tibble: 2 Ã— 4
sex variable statistic p
<fct> <chr> <dbl> <dbl>
1 female body_mass_g 0.919 0.0000000616
2 male body_mass_g 0.925 0.000000123
```

This data is not normal, which means we need to use non-parametric tests. Since we are not meeting the assumption for nornality, really you donâ€™t need to test for constant variance, but Iâ€™ll show you how to do it anyway.

#### Constant variance

We can test for equal variance using Leveneâ€™s test, `levene_test()`

which is part of the `rstatix`

package. Again, this is a pipe-friendly wrapper for the function `levene.test()`

.

```
%>%
penguins drop_na(body_mass_g, sex) %>% # remove NAs
levene_test(body_mass_g ~ sex) # test for constant variance
```

```
# A tibble: 1 Ã— 4
df1 df2 statistic p
<int> <int> <dbl> <dbl>
1 1 331 6.06 0.0143
```

No constant variance. Double Non-parametric.

Can we visualize normality another way?

```
%>%
penguins drop_na(body_mass_g, sex) %>%
ggplot(aes(x = body_mass_g, y = sex, fill = sex)) +
::geom_density_ridges(alpha = 0.7) +
ggridgesscale_y_discrete(labels = c("Female", "Male")) +
theme_classic() +
theme(legend.position = "none") +
labs(x = "Body Mass (g)",
y = "Sex",
title = "Distribution of body weights for male and female penguins")
```

`Picking joint bandwidth of 235`

Some of these distribution are bimodal (i.e., not normal). This is likely because we have 3 different species of penguins here. You can see below that actually each species looks reasonably normal.

```
%>%
penguins drop_na(body_mass_g, sex) %>%
ggplot(aes(x = body_mass_g, fill = sex)) +
geom_histogram() +
facet_grid(cols = vars(species), rows = vars(sex)) +
theme_classic() +
theme(legend.position = "none") +
labs(x = "Body Mass (g)",
y = "Count")
```

``stat_bin()` using `bins = 30`. Pick better value with `binwidth`.`

### Non-parametric t-test

This means if we want to test for different means, we can use the Wilcoxon rank sun test, or Mann Whitney test. If your data was normal, you could just change `wilcox_test()`

to `t_test()`

and the rest would be the same.

```
%>%
penguins drop_na(body_mass_g, sex) %>%
wilcox_test(body_mass_g ~ sex,
paired = FALSE)
```

```
# A tibble: 1 Ã— 7
.y. group1 group2 n1 n2 statistic p
* <chr> <chr> <chr> <int> <int> <dbl> <dbl>
1 body_mass_g female male 165 168 6874. 1.81e-15
```

This is not surprising, that there is a significant difference in body weight between male and female penguins. We can see this clearly in our plot.

How can we add the stats to our plot?

### Plot

#### Using `stat_compare_means()`

The function `stat_compare_means()`

allows mean comparison p-values to be easily added to a ggplot.

Note, the function should look at your data and test for normality and pick the statistical test accordingly. You can see that is working in the chunk below, but I would recommend that you always do your own statistical test and make sure you plot accordingly.

```
+
penguins_by_sex stat_compare_means()
```

```
+
penguins_by_sex stat_compare_means(method = "wilcox.test")
```

#### Manually with `geom_text()`

or `annotate()`

In general, plotting using `geom_text()`

is easier, and follows classic `geom_()`

syntax (e.g., includes `aes()`

) but for some reason these donâ€™t pass as vectorized objects so sometimes it yields low quality images. Using `annotate()`

passes as vectors and thus tends to be higher quality. You can decide which you want to use depending on your purpose.

If Iâ€™m being honest, the most common way that I would add statistics to a plot if I was trying to do just a few simple plots at once, would be with `annotate()`

. I like to use `annotate()`

over `geom_text()`

or `geom_label()`

because it is vectorized and donâ€™t become low quality down the road.

With `geom_text()`

```
+
penguins_by_sex geom_text(aes(x = 2, y = 6500, label = "*"), # x, y, and label within aes()
color = "black", size = 6)
```

With `annotate()`

```
+
penguins_by_sex annotate(geom = "text", # note no aes()
x = 2, y = 6500,
label = "*",
size = 6)
```

You can also add multiple annotation layers. Iâ€™m introducing a new function here, `glue()`

which is amazing for easy syntax pasting of strings with data.

The syntax for `glue()`

is like this:

```
<- 2 + 3
x
glue("2 + 3 = {x}")
```

`2 + 3 = 5`

```
# we did this already, just assigning to object
<- penguins %>%
by_sex_pval drop_na(body_mass_g, sex) %>%
wilcox_test(body_mass_g ~ sex,
paired = FALSE)
# plot
+
penguins_by_sex ylim(2500, 7500) + # adjust the y-axis so there's space for the label
annotate(geom = "text", x = 2, y = 6500, label = "*", size = 6) +
annotate(geom = "text", x = 2, y = 7000,
label = glue("Wilcoxon signed rank test \np-value = {by_sex_pval$p}"))
```

## >2 group comparisons (ANOVA or similar)

When we are comparing means between more than 2 samples, we will have to first run a statistical test to see if there are any significant differences among our groups, and then if there are, run a post-hoc test. Before we do that, letâ€™s plot.

Are there significant differences in body mass

```
<- penguins %>%
(penguins_f_massbyspecies drop_na(body_mass_g, species, sex) %>%
filter(sex == "female") %>%
ggplot(aes(x = species, y = body_mass_g, fill = species)) +
geom_violin(outlier.shape = NA,
draw_quantiles = 0.5) + # add the median by drawing 50% quantile
::geom_dots(side = "both", color = "black", alpha = 0.5) +
ggdisttheme_minimal() +
theme(legend.position = "none") +
labs(x = "Penguin Species",
y = "Body Mass (g)",
title = "Body mass of female penguins by species",
subtitle = "Collected from Palmer Station, Antarctica",
caption = "Data accessed from the R package palmerpenguins"))
```

```
Warning in geom_violin(outlier.shape = NA, draw_quantiles = 0.5): Ignoring
unknown parameters: `outlier.shape`
```

### Testing assumptions

#### Normality

```
# testing normality by group
%>%
penguins drop_na(body_mass_g, sex) %>% # remove NAs
filter(sex == "female") %>%
group_by(species) %>% # test by species
shapiro_test(body_mass_g) # test for normality
```

```
# A tibble: 3 Ã— 4
species variable statistic p
<fct> <chr> <dbl> <dbl>
1 Adelie body_mass_g 0.977 0.199
2 Chinstrap body_mass_g 0.963 0.306
3 Gentoo body_mass_g 0.981 0.511
```

```
# testing normality across all data
%>%
penguins drop_na(body_mass_g, sex) %>% # remove NAs
filter(sex == "female") %>%
shapiro_test(body_mass_g) # test for normality
```

```
# A tibble: 1 Ã— 3
variable statistic p
<chr> <dbl> <dbl>
1 body_mass_g 0.919 0.0000000616
```

Ok looks like we have normally distributed data among the different species of female penguins.

#### Constant variance

`levene_test()`

which is part of the `rstatix`

package. Again, this is a pipe-friendly wrapper for the function `levene.test()`

.

```
%>%
penguins drop_na(body_mass_g, sex, species) %>% # remove NAs
filter(sex == "female") %>%
levene_test(body_mass_g ~ species) # test for constant variance
```

```
# A tibble: 1 Ã— 4
df1 df2 statistic p
<int> <int> <dbl> <dbl>
1 2 162 0.0357 0.965
```

We have constant variance. Along with normally distributed data, this means that we can use parametric tests. In the case of >2 samples, that would be ANOVA.

### ANOVA

The most commonly used function to run ANOVA in R is called `aov()`

which is a part of the `stats`

package that is pre-loaded with base R. So no new packages need to be installed here.

If we want to learn more about the function `aov()`

we can do so using the code below. The help documentation will show up in the bottom right quadrant of your RStudio.

`aov() ?`

We can run an ANOVA by indicating our model, and here Iâ€™m also selecting to drop the NAs for our variables of interest, and filtering within the `data =`

argument.

```
<-
aov_female_massbyspecies aov(data = penguins %>%
filter(sex == "female") %>%
drop_na(body_mass_g, species),
~ species) body_mass_g
```

Now lets look at the aov object.

`summary(aov_female_massbyspecies)`

```
Df Sum Sq Mean Sq F value Pr(>F)
species 2 60350016 30175008 393.2 <2e-16 ***
Residuals 162 12430757 76733
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
```

We can take the output of our ANOVA and use the function `tidy()`

within the `broom`

package to turn our output into a tidy table. Here, the notation `broom::tidy()`

means I want to use the function `tidy()`

that is a part of the `broom`

package. This works even though I havenâ€™t called `library(broom)`

at the beginning of my script.

```
<- broom::tidy(aov_female_massbyspecies)
tidy_anova
::kable(tidy_anova) knitr
```

term | df | sumsq | meansq | statistic | p.value |
---|---|---|---|---|---|

species | 2 | 60350016 | 30175008.01 | 393.2465 | 0 |

Residuals | 162 | 12430757 | 76733.07 | NA | NA |

See how this is different from just saving the ANOVA summary? Open both `anova_summary`

and `tidy_anova`

and note the differences.

`<- summary(aov_female_massbyspecies) anova_summary `

### Posthoc group analysis

Now that we see we have a significant difference somewhere in the body mass of the 3 species of female penguins, we can do a posthoc test to see which groups are significantly different. We will do our post-hoc analysis using Tukeyâ€™s Honestly Significant Difference test and the function `HSD.test()`

which is a part of the useful package `agricolae`

.

```
<- HSD.test(aov_female_massbyspecies,
tukey_massbyspecies trt = "species",
console = TRUE) # prints the results to console
```

```
Study: aov_female_massbyspecies ~ "species"
HSD Test for body_mass_g
Mean Square Error: 76733.07
species, means
body_mass_g std r Min Max
Adelie 3368.836 269.3801 73 2850 3900
Chinstrap 3527.206 285.3339 34 2700 4150
Gentoo 4679.741 281.5783 58 3950 5200
Alpha: 0.05 ; DF Error: 162
Critical Value of Studentized Range: 3.345258
Groups according to probability of means differences and alpha level( 0.05 )
Treatments with the same letter are not significantly different.
body_mass_g groups
Gentoo 4679.741 a
Chinstrap 3527.206 b
Adelie 3368.836 c
```

Like we did with the aov object, you can also look at the resulting HSD.test object (here, `tukey_massbyspecies`

) in your environment pane.

Here, instead of using the `broom`

package, you can convert the part of the `tukey_bill_length`

object that contains the post-hoc groupings into a dataframe using `as.data.frame()`

.

```
<- as.data.frame(tukey_massbyspecies$groups)
tidy_tukey
tidy_tukey
```

```
body_mass_g groups
Gentoo 4679.741 a
Chinstrap 3527.206 b
Adelie 3368.836 c
```

### Plot

#### Using `stat_compare_means()`

```
+
penguins_f_massbyspecies stat_compare_means()
```

```
+
penguins_f_massbyspecies stat_compare_means(method = "anova")
```

#### Manually with `geom_text()`

or `annotate()`

In general, plotting using `geom_text()`

is easier, and follows classic `geom_()`

syntax (e.g., includes `aes()`

) but for some reason these donâ€™t pass as vectorized objects so sometimes it yields low quality images. Using `annotate()`

passes as vectors and thus tends to be higher quality. You can decide which you want to use depending on your purpose.

We want to add the letters to this plot, so we can tell which groups of penguin `species`

are significantly different.

Before we can do this, we will need to do some of everyoneâ€™s favorite task, wrangling. We are going to figure out what the maximum `body_mass_g`

for each species is, so it will help us determine where to put our letter labels. Then, we can add our labels to be higher than the largest data point. We will calculate this for each group, so that the letters are always right about our boxplot.

```
<- penguins %>%
body_mass_max filter(sex == "female") %>%
drop_na(body_mass_g, species) %>%
group_by(species) %>%
summarize(max_body_mass = max(body_mass_g))
body_mass_max
```

```
# A tibble: 3 Ã— 2
species max_body_mass
<fct> <int>
1 Adelie 3900
2 Chinstrap 4150
3 Gentoo 5200
```

Letâ€™s add our post-hoc group info to `body_mass_max`

, since those two dataframes are not in the same order. Instead of binding the two dataframes together, we are going to join them using one of the dplyr `_join()`

functions, which allows you to combine dataframes based on a specific common column. The join functions work like this:

`inner_join()`

: includes all rows in x and y.`left_join()`

: includes all rows in x.`right_join()`

: includes all rows in y.`full_join()`

: includes all rows in x or y.

In this case, it doesnâ€™t matter which `_join()`

we use because our dfs all have the exact same rows.

```
<- tidy_tukey %>%
tidier_tukey rownames_to_column() %>% # converts rownames to columns
rename(species = rowname) # renames the column now called rowname to species
# join
<- full_join(tidier_tukey, body_mass_max,
body_mass_for_plotting by = "species")
```

Letâ€™s plot. First using `geom_text()`

```
+
penguins_f_massbyspecies geom_text(data = body_mass_for_plotting,
aes(x = species,
y = 175 + max_body_mass,
label = groups))
```

Next using `annotate()`

.

```
+
penguins_f_massbyspecies annotate(geom = "text",
x = c(3,2,1),
y = 175 + body_mass_for_plotting$max_body_mass,
label = body_mass_for_plotting$groups)
```

## Useful resources

There have been previous Code Club sessions about adding statistics to plots: