Manhattan Plots
Week 11
Introduction
Today we are going to continue putting it together in Module 4. Todayβs material is on making Manhattan plots, which is a commonly used plot type for visualizing the result of genome wide association studies (GWAS). The name comes from its resemblance to the skyscrapers in Manhattan, poking above the background of the rest of the buildings.
The plot visualizes the relationship between a trait and genetic markers. The x-axis shows the position on each chromosome, and the y-axis shows the negative log (usually log10) p-value of the quantitative response of a trait to that specific marker. Negative log10 p-value is used because a significant p-value is always small, and this transformation converts low p-value to a number that can be seen easily among the background of non-significant associations.
If you work in genetics/genomics, it is likely you will create Manhattan plots. Even if you think youβll never make one of these types of plots, its a useful activity to see additional ways of customizing your plots.
library(tidyverse) # everything
library(glue) # easy pasting
library(ggrepel) # repelling labels
Read in data
Today we are going to continue to use some different real research data collected by Emma Bilbrey from my team where we conducted many GWAS in apple. This work was published in 2021 in New Phytologist and can be found here. This data is more complex than a typical GWAS so we are only going to use a small portion of it.
We will be reading in Table S16 which includes the -log10 p-values for the GWAS conducted across all apples for all features found in the LC-MS negative ionization mode metabolomics dataset.
The data is present in a .csv file, so we will use the function read_csv()
from the tidyverse. We want to import Supplemental Table 16.
This will take a second, its a big file.
<- read_csv("data/nph17693-sup-0007-tables16.csv") # be patient gwas
Rows: 11165 Columns: 4704
ββ Column specification ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Delimiter: ","
dbl (4704): Index, Linkage_Group, Genetic_Distance, X885.2037_2.98177, X525....
βΉ Use `spec()` to retrieve the full column specification for this data.
βΉ Specify the column types or set `show_col_types = FALSE` to quiet this message.
What are the dimensions of this dataframe? What kind of object is it?
dim(gwas)
[1] 11165 4704
class(gwas)
[1] "spec_tbl_df" "tbl_df" "tbl" "data.frame"
Because this dataframe is so big, if we use head(gwas)
we will get a print out of the first 6 rows, and all the columns. In thi case there are 4704 columns so that will be unwieldy.
Emma came up with a simple way to approach this when she was writing her code, she wrote herself a little function that she could use regularly to extract out the first 5 rows, and the first 5 columns, without having to index each time.
If we wanted to just see the first 5 rows, the first 5 columsn we could do this:
1:5,1:5] gwas[
# A tibble: 5 Γ 5
Index Linkage_Group Genetic_Distance X885.2037_2.98177 X525.1583_3.24969
<dbl> <dbl> <dbl> <dbl> <dbl>
1 1 1 0 0.176 0.117
2 3 1 0.002 0.0978 0.0166
3 4 1 0.003 0.169 0.0519
4 5 1 0.004 0.217 0.309
5 6 1 0.005 0.0548 0.110
<- function(x){
head_short 1:5,1:5] # this function shows the first 5 rows and columns of an object
x[ }
Now instead of indexing all the time, we can just run head_short()
which I think is easier. We will talk a little bit more about writing functions in the class on making many plots at once.
head_short(gwas)
# A tibble: 5 Γ 5
Index Linkage_Group Genetic_Distance X885.2037_2.98177 X525.1583_3.24969
<dbl> <dbl> <dbl> <dbl> <dbl>
1 1 1 0 0.176 0.117
2 3 1 0.002 0.0978 0.0166
3 4 1 0.003 0.169 0.0519
4 5 1 0.004 0.217 0.309
5 6 1 0.005 0.0548 0.110
Data investigating
How many markers are included here?
nrow(gwas)
[1] 11165
How many linkage groups do we have? (Each linkage group is a chromosome.)
unique(gwas$Linkage_Group)
[1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
What is the range of Genetic_Distance
for each chromosome?
%>%
gwas group_by(Linkage_Group) %>%
summarize(min_genetic_distance = min(Genetic_Distance),
max_genetic_distance = max(Genetic_Distance))
# A tibble: 17 Γ 3
Linkage_Group min_genetic_distance max_genetic_distance
<dbl> <dbl> <dbl>
1 1 0 63.1
2 2 0 78.4
3 3 0 74.0
4 4 0.004 65.5
5 5 0 77.8
6 6 0 75.3
7 7 0.001 82.4
8 8 0 68.5
9 9 0.292 67.0
10 10 0 81.3
11 11 0 80.9
12 12 0.382 65.4
13 13 0 71.4
14 14 0 64.4
15 15 0 112.
16 16 0.001 67.5
17 17 0 71.8
How are the Index
distributed across Linkage_Group
?
%>%
gwas group_by(Linkage_Group) %>%
summarize(min_index = min(Index),
max_index = max(Index))
# A tibble: 17 Γ 3
Linkage_Group min_index max_index
<dbl> <dbl> <dbl>
1 1 1 663
2 2 664 1687
3 3 1688 2630
4 4 2635 3432
5 5 3433 4530
6 6 4533 5266
7 7 5270 5934
8 8 5936 6792
9 9 6793 7623
10 10 7624 8648
11 11 8652 9728
12 12 9731 10719
13 13 10721 11558
14 14 11560 12365
15 15 12366 13605
16 16 13610 14428
17 17 14431 15260
Ok here we can see Index
does not repeat, but Genetic_Distance
restarts with each chromosome.
Manhattan plot: chlorogenic acid
At its core, a Manhattan plot is a scatter plot. The data we are working with has 4701 traits, which here are relative metabolite abundance. We are going to pick one metabolite to start working with.
We will start with the feature that represents chlorogenic acid, a caffeoyl-quinic acid you find in apples. The column we want is X353.09194_2.23795
. The data is already present as the -log10 p-value for the relationship between allelic variation at that marker, and relative abundance of chlorogenic acid.
# rename X353.09194_2.23795 to chlorogenic_acid
<- gwas %>%
gwas rename(chlorogenic_acid = `X353.09194_2.23795`)
%>%
gwas ggplot(aes(x = Index, y = chlorogenic_acid, color = Linkage_Group)) +
geom_point()
See how color is plotted on a continuous scale? This is because Linkage_Group
is a continuous, numeric variable. Since each chromosome is actually discrete, letβs convert Linkage_Group
to a factor and then plot again.
Linkage_Group
as a factor
$Linkage_Group <- as.factor(gwas$Linkage_Group)
gwas
%>%
gwas ggplot(aes(x = Index, y = chlorogenic_acid, color = Linkage_Group)) +
geom_point()
Better but this really isnβt what we want. We want our x-axis to indicate the chromosome number in the middle of the block of that chromosome, not label by Index
which just is a key for linking back to each specific marker.
Set axis
If we want to label the x-axis with breaks for each chromosome, we have to do some wrangling first. Just like we did some calculations in the lesson on adding statistics, we will calculate some min, center, and max for each chromosome so we know where to put the labels.
<- gwas %>%
(set_axis group_by(Linkage_Group) %>%
summarize(min = min(Index),
max = max(Index),
center = (max - min)/2))
# A tibble: 17 Γ 4
Linkage_Group min max center
<fct> <dbl> <dbl> <dbl>
1 1 1 663 331
2 2 664 1687 512.
3 3 1688 2630 471
4 4 2635 3432 398.
5 5 3433 4530 548.
6 6 4533 5266 366.
7 7 5270 5934 332
8 8 5936 6792 428
9 9 6793 7623 415
10 10 7624 8648 512
11 11 8652 9728 538
12 12 9731 10719 494
13 13 10721 11558 418.
14 14 11560 12365 402.
15 15 12366 13605 620.
16 16 13610 14428 409
17 17 14431 15260 414.
%>%
gwas ggplot(aes(x = Index, y = chlorogenic_acid, color = Linkage_Group)) +
geom_point() +
scale_x_continuous(breaks = (set_axis$center + set_axis$min),
labels = set_axis$Linkage_Group) +
theme_classic() +
theme(legend.position = "none") + # legend not really necessary
labs(x = "Chromosome",
y = expression("-log"[10]*"P-Value"),
title = "GWAS of chlorogenic acid in apple")
Alternate colors
Having a rainbow of colors is not really necessary here,a nd in fact telling exactly where chromosome 15 ends and 16 begins is difficult because the colors are so similar.
What you will see in a lot of papers is people simply alternate the colors of their points by chromosome so you can easily tell which points belong to which chromosome.
%>%
gwas ggplot(aes(x = Index, y = chlorogenic_acid, color = Linkage_Group)) +
geom_point() +
scale_x_continuous(breaks = (set_axis$center + set_axis$min),
labels = set_axis$Linkage_Group) +
scale_color_manual(values = rep(c("black", "darkgray"), 17)) +
theme_classic() +
theme(legend.position = "none",
plot.title = element_text(hjust = 0.5)) +
labs(x = "Chromosome",
y = expression("-log"[10]*"P-Value"),
title = "Manhattan Plot after GWAS for Chlorogenic Acid in Apple")
Removing that annoying front gap
The gap between chromosome 1 and the y-axis of the plot sort of bothers me. Letβs remove it.
%>%
gwas ggplot(aes(x = Index, y = chlorogenic_acid, color = Linkage_Group)) +
geom_point() +
scale_x_continuous(expand = c(0,0), # remove gap between y-axis and chr1
breaks = (set_axis$center + set_axis$min),
labels = set_axis$Linkage_Group) +
scale_color_manual(values = rep(c("black", "grey52"), 17)) +
theme_classic() +
theme(legend.position = "none",
plot.title = element_text(hjust = 0.5)) +
labs(x = "Chromosome",
y = expression("-log"[10]*"P-Value"),
title = "Manhattan Plot after GWAS for Chlorogenic Acid in Apple")
Add p-value hline
# what would the pvalue cut off with a bonferroni correction be?
<- -log10(0.05/nrow(gwas))
bonferroni_pval
%>%
gwas ggplot(aes(x = Index, y = chlorogenic_acid, color = Linkage_Group)) +
geom_point() +
geom_hline(yintercept = bonferroni_pval, color = "grey", linetype = "dashed") +
scale_x_continuous(expand = c(0,0),
breaks = (set_axis$center + set_axis$min),
labels = set_axis$Linkage_Group) +
scale_color_manual(values = rep(c("black", "darkgray"), 17)) +
theme_classic() +
theme(legend.position = "none",
plot.title = element_text(hjust = 0.5)) +
labs(x = "Chromosome",
y = expression("-log"[10]*"P-Value"),
title = "Manhattan Plot after GWAS for Chlorogenic Acid in Apple")
Color sig points
# select all SNPs with -log10 pvalue > bonferroni cutoff for chlorogenic acid
<- gwas %>%
chlorogenic_acid_sig filter(chlorogenic_acid > bonferroni_pval) %>%
select(Index, Linkage_Group, Genetic_Distance, chlorogenic_acid)
%>%
gwas ggplot(aes(x = Index, y = chlorogenic_acid, color = Linkage_Group)) +
geom_point() +
geom_point(data = chlorogenic_acid_sig,
aes(x = Index, y = chlorogenic_acid), color = "red") +
geom_hline(yintercept = bonferroni_pval, color = "grey", linetype = "dashed") +
scale_x_continuous(expand = c(0,0),
breaks = (set_axis$center + set_axis$min),
labels = set_axis$Linkage_Group) +
scale_color_manual(values = rep(c("black", "darkgray"), 17)) +
theme_classic() +
theme(legend.position = "none",
plot.title = element_text(hjust = 0.5)) +
labs(x = "Chromosome",
y = expression("-log"[10]*"P-Value"),
title = "Manhattan Plot after GWAS for Chlorogenic Acid in Apple")
Label most sig marker
We might be interested to know the marker that has the most significant association with chlorogenic acid content, and label it on our plot.
<- chlorogenic_acid_sig %>%
biggest_pval filter(chlorogenic_acid == max(chlorogenic_acid))
%>%
gwas ggplot(aes(x = Index, y = chlorogenic_acid, color = Linkage_Group)) +
geom_point() +
geom_point(data = chlorogenic_acid_sig,
aes(x = Index, y = chlorogenic_acid), color = "red") +
geom_label_repel(data = biggest_pval,
aes(x = Index, y = chlorogenic_acid, label = glue("Index: {Index}"))) +
geom_hline(yintercept = bonferroni_pval, color = "grey", linetype = "dashed") +
scale_x_continuous(expand = c(0,0),
breaks = (set_axis$center + set_axis$min),
labels = set_axis$Linkage_Group) +
scale_color_manual(values = rep(c("black", "darkgray"), 17)) +
theme_classic() +
theme(legend.position = "none",
plot.title = element_text(hjust = 0.5)) +
labs(x = "Chromosome",
y = expression("-log"[10]*"P-Value"),
title = "Manhattan Plot after GWAS for Chlorogenic Acid in Apple")
Investigating other traits
In this study, we conducted a series of GWAS on thousands of metabolomic features in apple. What if we wanted to see Manhattan plots for certain features based on how important we could predict they would be? For example, what if we want to see the Manhattan plot for the feature with biggest -log10p-value? Or the feature that has a significant association with the largest number of markers?
To make this wrangling easier, we will convert our data, as we have many times before, from wide to long with pivot_longer()
.
Wide to long (again)
<- gwas %>%
gwas_tidy pivot_longer(cols = starts_with("X"),
names_to = "Feature",
values_to = "NegLog10P")
Set p-value cutoff
We can make another df that includes only the features that have at least one marker where there is a significant p-value.
# make df of associations that pass bonferroni correction
<- gwas_tidy %>%
gwas_tidy_bonferroni filter(NegLog10P > bonferroni_pval)
# how many unique features are there?
length(unique(gwas_tidy_bonferroni$Feature))
[1] 962
# how many unique markers are there?
length(unique(gwas_tidy_bonferroni$Index))
[1] 544
There are 962 unique features/metabolite that have a Bonferroni adjusted significant p-value with at least one marker. There are 544 unique markers that have a Bonferroni adjusted significant p-value with at least one feature/metabolite.
Data investigating
What features are associated with the largest number of markers?
%>%
gwas_tidy_bonferroni group_by(Feature) %>%
count() %>%
arrange(desc(n))
# A tibble: 962 Γ 2
# Groups: Feature [962]
Feature n
<chr> <int>
1 X417.13237_1.82968 46
2 X349.15073_1.79191 44
3 X601.13217_2.40546 34
4 X593.12835_2.53465 31
5 X291.0768_2.44657 30
6 X591.1485_2.86273 30
7 X637.09169_2.78692 30
8 X661.08791_2.10005 30
9 X137.02484_2.44808 29
10 X561.13983_2.53357 29
# βΉ 952 more rows
Wow, the marker X417.13237_1.82968
has significant associations with 46 markers. What would that Manhattan plot look like?
%>%
gwas_tidy filter(Feature == "X417.13237_1.82968") %>%
ggplot(aes(x = Index, y = NegLog10P, color = Linkage_Group)) +
geom_point() +
geom_hline(yintercept = bonferroni_pval, color = "grey", linetype = "dashed") +
scale_x_continuous(expand = c(0,0),
breaks = (set_axis$center + set_axis$min),
labels = set_axis$Linkage_Group) +
scale_color_manual(values = rep(c("black", "darkgray"), 17)) +
theme_classic() +
theme(legend.position = "none",
plot.title = element_text(hjust = 0.5)) +
labs(x = "Chromosome",
y = expression("-log"[10]*"P-Value"),
title = "Manhattan Plot after GWAS for 417.13237 m/z at retention time 1.82968 in Apple")
Making many plots at once
What if we want to make Manhattan plots for the 50 features/metabolites that are associated with the most markers? This is probably too many plots to facet, so we can do some calculations, and then write a function to make plots, and apply it over our dataframe.
First, how many significant associations with a Bonferroni multiple testing correction are there?
# make df of associations that pass bonferroni correction
<- gwas_tidy %>%
gwas_tidy_bonferroni filter(NegLog10P > bonferroni_pval)
# how many unique features are this?
%>%
gwas_tidy_bonferroni count(Feature)
# A tibble: 962 Γ 2
Feature n
<chr> <int>
1 X1000.22158_2.71331 12
2 X1000.72569_2.72017 2
3 X1001.23392_2.70506 5
4 X1008.71962_2.91507 3
5 X1008.72084_2.64352 2
6 X1009.22592_2.91262 5
7 X1009.72353_2.64479 2
8 X1010.22369_2.64604 2
9 X1010.72865_2.64538 1
10 X1014.70219_2.12784 2
# βΉ 952 more rows
# how many unique markers are there?
%>%
gwas_tidy_bonferroni count(Index)
# A tibble: 544 Γ 2
Index n
<dbl> <int>
1 170 1
2 217 1
3 218 1
4 233 2
5 294 1
6 311 1
7 341 2
8 368 1
9 386 4
10 520 6
# βΉ 534 more rows
Which features are associated with the largest number of markers?
%>%
gwas_tidy_bonferroni count(Feature) %>%
arrange(desc(n))
# A tibble: 962 Γ 2
Feature n
<chr> <int>
1 X417.13237_1.82968 46
2 X349.15073_1.79191 44
3 X601.13217_2.40546 34
4 X593.12835_2.53465 31
5 X291.0768_2.44657 30
6 X591.1485_2.86273 30
7 X637.09169_2.78692 30
8 X661.08791_2.10005 30
9 X137.02484_2.44808 29
10 X561.13983_2.53357 29
# βΉ 952 more rows
Which markers are associated with the largest number of features?
%>%
gwas_tidy_bonferroni count(Index) %>%
arrange(desc(n))
# A tibble: 544 Γ 2
Index n
<dbl> <int>
1 13684 320
2 13685 318
3 13681 317
4 13715 298
5 13657 223
6 13660 223
7 13675 221
8 13630 219
9 13623 199
10 13617 198
# βΉ 534 more rows
We will make a new df that includes only the 50 features with the most makers associated with them.
# create a df with only the top 50 features with the most marker associations
<- gwas_tidy_bonferroni %>%
top50 count(Feature) %>%
arrange(desc(n)) %>%
slice_head(n = 50)
# filter the whole dataset to include only these features
<- gwas_tidy %>%
gwas_top50 filter(Feature %in% top50$Feature)
# go from long to wide
<- gwas_top50 %>%
gwas_top50_wide pivot_wider(names_from = Feature, values_from = NegLog10P)
# what are our new dimensions?
dim(gwas_top50_wide)
[1] 11165 54
head_short(gwas_top50_wide)
# A tibble: 5 Γ 5
Index Linkage_Group Genetic_Distance chlorogenic_acid X599.12186_2.10421
<dbl> <fct> <dbl> <dbl> <dbl>
1 1 1 0 0.361 0.266
2 3 1 0.002 0.239 0.0346
3 4 1 0.003 1.03 0.120
4 5 1 0.004 0.188 0.497
5 6 1 0.005 1.23 0.00356
Writing a function to plot
Here, we are just modifying our plotting code slightly to allow it to be run across different features. The first thing we will do is to use our favorite function pivot_longer()
to create tidy data.
<- gwas_top50_wide %>%
gwas_top50_long pivot_longer(cols = starts_with("X"),
names_to = "feature",
values_to = "pvalue")
Then we can write a function to plot, where we will iterate across feature_of_interest
. Here, feature_of_interest
is just the name Iβve assigned here, but you could easily call it x
or i
or whatever.
# write a function to make your plots across the features of interest
<- function(feature_of_interest){
manhattan_plot %>% # our df with only the top 50, but long
gwas_top50_long filter(feature == feature_of_interest) %>% # pick the feature_of_interest only
ggplot(aes(x = Index, y = pvalue, color = Linkage_Group)) +
geom_point() +
geom_hline(yintercept = bonferroni_pval, color = "grey", linetype = "dashed") +
scale_x_continuous(expand = c(0,0),
breaks = (set_axis$center + set_axis$min),
labels = set_axis$Linkage_Group) +
scale_color_manual(values = rep(c("black", "gray"),17)) +
labs(x = "Chromosome",
y = expression("-log"[10]*"P-Value"),
title = glue("{feature_of_interest}")) + # here we glue the feature name in the title
theme_classic() +
theme(legend.position = "none",
plot.title = element_text(hjust = 0.5))
}
Before trying to use our function on 50 features, letβs try it out on one. We can provide our feature of interest as a string.
manhattan_plot("X599.12186_2.10421")
We are going to iterate over the names features_to_plot
. Iβm creating a vector of the names of the features we want to iterate over.
<- unique(gwas_top50_long$feature) features_to_plot
Applying the function with lapply()
Once we have our function written, we can use a function in the apply()
family of functions (here, lapply()
which applies and creates a list). Here Iβm just printing the first 6 plots in the list.
# use lapply to run your function over the features of interest
# if you don't want your plots to print, you should assign them to something
<- lapply(X = features_to_plot, # what to iterate over
my_plots FUN = manhattan_plot) # what function to use
# print the first 6
1:6] my_plots[
[[1]]
[[2]]
[[3]]
[[4]]
[[5]]
[[6]]
Saving out plots
But you can print them all, save particular ones usingggsave()
, or do what we are going to do here, which is save each of them to a new folder, each as their own .svg
because why use raster when you can vectorize.
First we will create a vector of what we want our file names to look like. Then we will save.
# use str_c to combine two character vectors
# here, features_to_plot and adding .svg so the file name
# includes the extension type
# then set that as the names for my_plots
names(my_plots) <- str_c(features_to_plot, ".svg")
# use pwalk to "walk" across the different plots and save them
pwalk(list(names(my_plots), my_plots), # what to iterate over and output
# what the function is
ggsave, path = "img/") # where they should go
Now all of your plots are in your working directory. Remember, you need to add the directory img
if you want to save with the code Iβm using here.