Visualizing Correlations

Week 8

Author

Jessica Cooperstone

A cartoon about how correlation is not causation

Figure from XKCD

Introduction

We will will building on our lessons on ggplot101 and ggplot102 which focused on an overall understanding of the grammar of graphics, basic syntax, adding data, aesthetic mappings, geoms, facets, scales, labels, and themes. Today we are going to apply what we learned towards trying to better understanding and visualize correlations within our data. To do this we will also use some ggplot extension packages.

Load libraries and data

Before we get started, let’s load our libraries.

library(tidyverse)

Today we are going to continue to use the same real research data from my group from last week. We will be reading in the supplementary data from a paper written by Michael Dzakovich, and published in The Plant Genome. The data is present in a Excel worksheet, so we will use the function read_excel() from the tidyverse (but not core tidyverse) package readxl. We want to import Supplemental Table 1. You can indicate which sheet you want to import in the arguments to read_excel().

alkaloids <- readxl::read_excel("tpg220192-sup-0002-supmat.xlsx",
                                sheet = "S1 Raw Data Diversity Panel")
knitr::kable(head(alkaloids))
ID Year Environment Block Genotype Plot_Source Class Origin Provence Blanca_Cluster1 Blanca_Cluster2 Passport_Species Passport_Classification Sim_Grouping Latitude Longitude Dehydrotomatidine Tomatidine Dehydrotomatine1 Dehydrotomatine2 TotalDehydrotomatine Tomatine Hydroxytomatine1 Hydroxytomatine2 Hydroxytomatine3 Hydroxytomatine4 TotalHydroxytomatine Acetoxytomatine1 Acetoxytomatine2 Acetoxytomatine3 TotalAcetoxytomatine DehydrolycoperosideFGdehydroesculeosideA LycoperosideFGEsculeosideA1 LycoperosideFGEsculeosideA2 TotalLycoperosideFGEsculeosideA EsculeosideB1 EsculeosideB2 EsculeosideB3 TotalEsculeosideB Total
7805 2018 FreEarly18 1 CULBPT_05_11 2K17-7724 Cultivated Processing USA NY SLL_processing_2 SLL_processing_2 SLL SLL_processing_NY Arid 40.712800000000001 -74.006 0.000000 0.000000 5.726010 0.350331 6.076341 172.66244 1.079190 86.72742 17.831892 9.142607 114.78111 18.902399 56.307182 1.890053 77.099634 5.125904 10.277325 336.8893 347.1666 3.787979 0.924195 3.943230 8.655404 731.5675
7898 2017 Fre17 2 CULBPT_05_11 2K9-8584 Cultivated Processing USA NY SLL_processing_2 SLL_processing_2 SLL SLL_processing_NY Arid 40.712800000000001 -74.006 0.000000 0.169068 0.000000 0.000000 0.000000 55.47329 0.000000 53.32292 13.630697 4.841762 71.79538 3.557348 4.107289 0.000000 7.664637 2.905500 5.548102 199.6694 205.2175 8.978931 1.897850 6.794690 17.671471 360.8969
7523 2018 FreLate18 2 CULBPT_05_11 2K17-7724 Cultivated Processing USA NY SLL_processing_2 SLL_processing_2 SLL SLL_processing_NY Arid 40.712800000000001 -74.006 0.135675 0.680554 5.073552 0.000000 5.073552 123.85835 0.000000 50.90989 6.503939 1.368847 58.78268 3.931461 4.123222 0.623340 8.678023 2.185082 5.104115 259.0177 264.1218 4.049145 0.000000 6.749386 10.798531 474.3143
7724 2017 Fre17 1 CULBPT_05_11 2K9-8584 Cultivated Processing USA NY SLL_processing_2 SLL_processing_2 SLL SLL_processing_NY Arid 40.712800000000001 -74.006 0.054300 0.497261 19.419087 0.000000 19.419087 239.01264 0.000000 36.02318 8.557673 7.483933 52.06478 3.341048 16.415426 1.057100 20.813574 0.000000 0.000000 203.0061 203.0061 1.678210 0.000000 2.349633 4.027843 538.8955
7427 2018 FreLate18 1 CULBPT_05_11 2K17-7724 Cultivated Processing USA NY SLL_processing_2 SLL_processing_2 SLL SLL_processing_NY Arid 40.712800000000001 -74.006 0.139454 0.553801 0.000000 0.000000 0.000000 64.31783 0.879435 39.91027 7.228388 3.015298 51.03339 0.000000 3.131685 0.000000 3.131685 0.000000 4.054211 299.5687 303.6229 10.146857 0.000000 4.882339 15.029197 437.8283
7854 2018 FreEarly18 2 CULBPT_05_11 2K17-7724 Cultivated Processing USA NY SLL_processing_2 SLL_processing_2 SLL SLL_processing_NY Arid 40.712800000000001 -74.006 0.049700 0.262174 3.737579 0.000000 3.737579 68.44913 0.000000 23.86864 13.506299 1.456982 38.83192 4.657902 4.259007 0.605729 9.522638 9.832149 11.595595 459.5205 471.1161 6.839930 0.486236 5.595751 12.921917 614.7233

This dataset has 605 observations, with data about different steroidal alkaloids in the fruits of different tomato germplasm grown in 3 locations across 2 years. There is also some other metadata too.

For those who are chemistry minded, here is a little pathway context for the compounds we are investigating today.

Scatterplots

A very simple first pass way to understand if you have relationships within your data is to make scatterplots of the variables you think might be correlated. Let’s start by investigating how the different alkaloid concentrations are correlated to each other. First we will see how alpha-tomatine content (Tomatine) is related to total steroidal alkaloid content (Total).

alkaloids %>%
  ggplot(aes(x = Total, y = Tomatine)) +
  geom_point() +
  labs(x = "Total Steroidal Alkaloids (µg/100 g)",
       y = "Alpha-Tomatine (µg/100 g)")

It seems like there are two separate groups here - the points with a steeper slope, and the points with a less steep slope. We can color our points based on what Class of tomato the data comes from, maybe that will reveal something. In the meanwhile let’s make this plot look a bit nicer. The package scales has some nice functions that help you control the scaling of your plots, in this case, making each of the axes have numbers in comma_format(). I also am using the hex codes for a color-blind friendly qualitative color scheme developed by Paul Tol.

library(scales)

Attaching package: 'scales'
The following object is masked from 'package:purrr':

    discard
The following object is masked from 'package:readr':

    col_factor
alkaloids %>%
  ggplot(aes(x = Total, y = Tomatine, color = Class)) +
  geom_point(alpha = 0.8) +
  scale_x_continuous(labels = comma_format(big.mark = ",")) + # requires the package scales
  scale_y_continuous(labels = comma_format(big.mark = ",")) + # requires the package scales
  scale_color_manual(values = c("#4477AA", "#EE6677", "#228833", "#CCBB44", "#66CCEE")) +
  theme_minimal() +
  labs(x = "Total Steroidal Alkaloids (µg/100 g fresh weight)",
       y = "Alpha-Tomatine (µg/100 g fresh weight)",
       title = "Relationship between Alpha-Tomatine and Total Steroidal Alkaloids \nAcross Different Germplasm in the Red Tomato Clade")

All of the tomatoes in the two extremes of this plot are from the Class Wild Cherry. What would this look like if we removed these fruits? Note, I adjusted the color scale to remove the hex code associated with Wild Cherry but keeping the other colors the same.

alkaloids %>%
  filter(Class != "Wild Cherry") %>%
  ggplot(aes(x = Total, y = Tomatine, color = Class)) +
  geom_point() +
  scale_x_continuous(labels = comma_format(big.mark = ",")) + # requires the package scales
  scale_y_continuous(labels = comma_format(big.mark = ",")) + # requires the package scales
  scale_color_manual(values = c("#4477AA", "#EE6677", "#228833", "#CCBB44")) +
  theme_minimal() +
  labs(x = "Total Steroidal Alkaloids (µg/100 g fresh weight)",
       y = "Alpha-Tomatine (µg/100 g fresh weight)",
       title = "Relationship between Alpha-Tomatine and Total Steroidal Alkaloids \nAcross Different Germplasm in the Red Tomato Clade")

Adding geom_smooth()

alkaloids %>%
  ggplot(aes(x = Total, y = Tomatine, color = Class)) +
  geom_point(alpha = 0.8) +
  geom_smooth(method = "lm") +
  scale_x_continuous(labels = comma_format(big.mark = ",")) + # requires the package scales
  scale_y_continuous(labels = comma_format(big.mark = ",")) + # requires the package scales
  scale_color_manual(values = c("#4477AA", "#EE6677", "#228833", "#CCBB44", "#66CCEE")) +
  theme_minimal() +
  labs(x = "Total Steroidal Alkaloids (µg/100 g fresh weight)",
       y = "Alpha-Tomatine (µg/100 g fresh weight)",
       title = "Relationship between Alpha-Tomatine and Total Steroidal Alkaloids \nAcross Different Germplasm in the Red Tomato Clade")
`geom_smooth()` using formula = 'y ~ x'

Faceted scatterplots

We may be able to see trends by tomato class more easily if we facet our scatterplots. I also am demonstrating here how within the ggplot function you can make alter the aesthetics you plot - here I am turning data that is present as µg/100 g to mg/100 g by dividing by 1000 and changing the axis labels accordingly.

alkaloids %>%
  ggplot(aes(x = Total/1000, y = Tomatine/1000, color = Class)) +
  geom_point(alpha = 0.8) +
  scale_color_manual(values = c("#4477AA", "#EE6677", "#228833", "#CCBB44", "#66CCEE")) +
  facet_wrap(vars(Class), scales = "free") +
  theme_minimal() +
  theme(legend.position = "none") +
  labs(x = "Total Steroidal Alkaloids (mg/100 g fresh weight)",
       y = "Alpha-Tomatine (mg/100 g fresh weight)",
       title = "Relationship between Alpha-Tomatine and Total Steroidal Alkaloids \nAcross Different Germplasm in the Red Tomato Clade")

Correlation matrix with cor()

cor() is a function from base R that will allow you to create a correlation matrix.

Before we use cor() we will clean up our dataset to include only the variables we want to correlate.

colnames(alkaloids)
 [1] "ID"                                      
 [2] "Year"                                    
 [3] "Environment"                             
 [4] "Block"                                   
 [5] "Genotype"                                
 [6] "Plot_Source"                             
 [7] "Class"                                   
 [8] "Origin"                                  
 [9] "Provence"                                
[10] "Blanca_Cluster1"                         
[11] "Blanca_Cluster2"                         
[12] "Passport_Species"                        
[13] "Passport_Classification"                 
[14] "Sim_Grouping"                            
[15] "Latitude"                                
[16] "Longitude"                               
[17] "Dehydrotomatidine"                       
[18] "Tomatidine"                              
[19] "Dehydrotomatine1"                        
[20] "Dehydrotomatine2"                        
[21] "TotalDehydrotomatine"                    
[22] "Tomatine"                                
[23] "Hydroxytomatine1"                        
[24] "Hydroxytomatine2"                        
[25] "Hydroxytomatine3"                        
[26] "Hydroxytomatine4"                        
[27] "TotalHydroxytomatine"                    
[28] "Acetoxytomatine1"                        
[29] "Acetoxytomatine2"                        
[30] "Acetoxytomatine3"                        
[31] "TotalAcetoxytomatine"                    
[32] "DehydrolycoperosideFGdehydroesculeosideA"
[33] "LycoperosideFGEsculeosideA1"             
[34] "LycoperosideFGEsculeosideA2"             
[35] "TotalLycoperosideFGEsculeosideA"         
[36] "EsculeosideB1"                           
[37] "EsculeosideB2"                           
[38] "EsculeosideB3"                           
[39] "TotalEsculeosideB"                       
[40] "Total"                                   

From looking at the colnames and reading the supplemental information, we can see that some columns are composites of others. For example, the column TotalAcetoxytomatine = Acetoxytomatine1 + Acetoxytomatine2 + Acetoxytomatine3. So we want to pull only the columns that represent the total for any given alkaloids. There should be 10 columns.

# create a vector of the names we want to keep
alkaloid_total_names <- c("Dehydrotomatidine",
                          "Tomatidine",
                          "TotalDehydrotomatine",
                          "Tomatine",
                          "TotalHydroxytomatine",
                          "TotalAcetoxytomatine",
                          "DehydrolycoperosideFGdehydroesculeosideA",
                          "TotalLycoperosideFGEsculeosideA",
                          "TotalEsculeosideB",
                          "Total")

# make a new df including some metadata and the alkaloid_total_names
alkaloids_totals <- alkaloids %>%
  select(ID, Year, Environment, Block, Genotype, Class, all_of(alkaloid_total_names))

# did it work? look at colnames()
colnames(alkaloids_totals)
 [1] "ID"                                      
 [2] "Year"                                    
 [3] "Environment"                             
 [4] "Block"                                   
 [5] "Genotype"                                
 [6] "Class"                                   
 [7] "Dehydrotomatidine"                       
 [8] "Tomatidine"                              
 [9] "TotalDehydrotomatine"                    
[10] "Tomatine"                                
[11] "TotalHydroxytomatine"                    
[12] "TotalAcetoxytomatine"                    
[13] "DehydrolycoperosideFGdehydroesculeosideA"
[14] "TotalLycoperosideFGEsculeosideA"         
[15] "TotalEsculeosideB"                       
[16] "Total"                                   

Now we can create a correlation matrix to see how each of our 10 alkaloids is correlated to the concentration of each other alkaloid (including the compile metric of Total which sums all the alkaloids). The default for cor() is to use Pearson’s correlation coefficient, but you can set to use Spearman method = "spearman" or Kendall method = "kendall" if you prefer. Check the documentation for cor() for more information.

alkaloids_cor <- alkaloids_totals %>%
  select(all_of(alkaloid_total_names)) %>%
  cor()

# look at our correlation matrix
knitr::kable(alkaloids_cor) # kable makes a nicely formatted table
Dehydrotomatidine Tomatidine TotalDehydrotomatine Tomatine TotalHydroxytomatine TotalAcetoxytomatine DehydrolycoperosideFGdehydroesculeosideA TotalLycoperosideFGEsculeosideA TotalEsculeosideB Total
Dehydrotomatidine 1.0000000 0.2974462 0.0324918 0.0238230 0.0099126 0.0322029 0.0305049 0.0761907 0.0282219 0.0708252
Tomatidine 0.2974462 1.0000000 0.3744672 0.3736949 0.1003558 0.0382981 -0.0059964 0.0373649 0.0126724 0.2044979
TotalDehydrotomatine 0.0324918 0.3744672 1.0000000 0.9214859 0.2290192 0.4011257 -0.0820469 -0.1149682 -0.1217560 0.5636969
Tomatine 0.0238230 0.3736949 0.9214859 1.0000000 0.0995212 0.1220596 -0.1140360 -0.1357819 -0.1260377 0.3756155
TotalHydroxytomatine 0.0099126 0.1003558 0.2290192 0.0995212 1.0000000 0.3563506 0.0330078 0.0284887 0.0134806 0.4774036
TotalAcetoxytomatine 0.0322029 0.0382981 0.4011257 0.1220596 0.3563506 1.0000000 -0.0865506 -0.1106212 -0.0947254 0.6782337
DehydrolycoperosideFGdehydroesculeosideA 0.0305049 -0.0059964 -0.0820469 -0.1140360 0.0330078 -0.0865506 1.0000000 0.8862982 0.7401116 0.4792108
TotalLycoperosideFGEsculeosideA 0.0761907 0.0373649 -0.1149682 -0.1357819 0.0284887 -0.1106212 0.8862982 1.0000000 0.7779405 0.5222751
TotalEsculeosideB 0.0282219 0.0126724 -0.1217560 -0.1260377 0.0134806 -0.0947254 0.7401116 0.7779405 1.0000000 0.4059824
Total 0.0708252 0.2044979 0.5636969 0.3756155 0.4774036 0.6782337 0.4792108 0.5222751 0.4059824 1.0000000

Note the diagonal is all composed of 1s. This makes sense because the correlation of each alkaloid with itself is 1.

Using ggcorrplot() from ggcorrplot

Use the function ggcorrplot() without any additional arguments besides the correlation matrix alkaloids_cor. In general, I think if you want to make a bunch of correlation plots quickly, and don’t intend to publish them, `ggcorrplot() works well, but the visuals of the plot are quite difficult to customize.

library(ggcorrplot)

ggcorrplot(alkaloids_cor)

This is not a perfect plot but its a good starting point. Correlation matrices are inherently symmetric, meaning if we display only the top or bottom triangle, we do not lose any information. We will work on editing this plot in different ways to show more information and make it more beautiful.

We could also make the plot circles instead of squares at the same time.

ggcorrplot(alkaloids_cor, 
           method = "circle",
           type = "lower")

In general, I think if you want to make a bunch of correlation plots quickly, and don’t intend to publish them, `ggcorrplot() works well, but the visuals of the plot are quite difficult to customize.

Using corrplot() from corrplot

Similarly, you can use a base R plotting based package corrplot() to make correlation plots. The customization syntax here is quite different from what we’ve been working with in ggplot, but I wanted you to feel familiar with some base R tools.

library(corrplot)
corrplot 0.92 loaded
corrplot(alkaloids_cor, type = "lower")

I have used corrplot() in publications before and felt like I couldn’t customize the plots as much as I wanted. In the process of putting together this content, I learned some news ways to customize these plots that are actually very nice. Here are some parameters you can modify in R. You can also order your variables by hierarchical clustering.

First we will start (as we always do) by wrangling.

# create matrix for correlation
alkaloids_to_cor <- alkaloids_totals %>%
  select(all_of(alkaloid_total_names)) %>%
  as.matrix() # rcorr() needs a matrix

library(Hmisc) # does cor() but also computes significance levels

Attaching package: 'Hmisc'
The following objects are masked from 'package:dplyr':

    src, summarize
The following objects are masked from 'package:base':

    format.pval, units
# create a matrix of pvalues for the correlations
alkaloids_rcorr = rcorr(alkaloids_to_cor, type = "pearson")

# create a vector of the alkaloid names for labeling
alkaloid_labels <- c("Dehydrotomatidine",
                     "Tomatidine",
                     "Dehydrotomatine",
                     "Alpha-Tomatine",
                     "Hydroxytomatine",
                     "Acetoxytomatine",
                     "Dehydrlycoperoside F, G, \nor Dehydroescueloside A",
                     "Lycoperoside F, G, \nor Escueloside A",
                     "Escueloside B",
                     "Total Steroidal Alkaloids")

# change row and column names of the correlation matrix
# so they are how we want them to be plotted
colnames(alkaloids_rcorr$r) <- alkaloid_labels
rownames(alkaloids_rcorr$r) <- alkaloid_labels

# change row and column names of the pvalue matrix
# so they are how we want them to be plotted
colnames(alkaloids_rcorr$P) <- alkaloid_labels
rownames(alkaloids_rcorr$P) <- alkaloid_labels

Now we are ready to plot

corrplot(alkaloids_rcorr$r, # the correlation matrix
         type = "lower", # lower triangle
         tl.col = "black", # axis labels are black
         p.mat  = alkaloids_rcorr$P, # pvalue matrix
         sig.level = 0.05, # how sig does a cor need to be to be included
         insig = "blank", # do not display insignificant correlations
         addCoef.col = "black", # display correlations in black
         diag = FALSE, # don't show the diagonal (because this is all 1)
         number.cex = 0.6) # size of correlation font

GGally

ggcorr()

Another ggplot extension package ggally has the function ggcorr() which also allows the creation of correlation plots, but ones that are more easily customizable. ggcorr() objects are moderately customizable. They make work for some of you so I’m sharing how to make them.

Note, GGally::ggcorr() does not take a correlation matrix, but instead takes the data you want to make a correlation matrix for. You can specific the method of correlation in the arguments. The default is Pearson’s correlation.

library(GGally)
Registered S3 method overwritten by 'GGally':
  method from   
  +.gg   ggplot2
to_corr <- alkaloids_totals %>%
  select(all_of(alkaloid_total_names)) 

ggcorr(to_corr)

There is only one labeled axis - this is because there is no diagonal in these plots, like we saw with ggcorrplot() and corrplot().

We can now spend some time improving the aesthetics of our plot.

ggcorr(to_corr, # data for correlation
       low = "#f1a340", # -1 correlation color
       mid = "#f7f7f7", # 0 correlation color
       high = "#998ec3") # 1 correlation color

ggcorr(to_corr,
       low = "#f1a340", mid = "#f7f7f7", high = "#998ec3",
       geom = "circle",
       label = TRUE, 
       label_size = 2, 
       label_round = 2,
       layout.exp = 3)

For this example, we have very long label names which are really difficult to wrap, but if your labels are more reasonable this may work well for you.

ggpairs()

We can also use the function GGally::ggpairs() to make a matrix of correlation related plots.

alkaloids_totals %>%
  ggpairs(columns = c("Tomatine", "TotalLycoperosideFGEsculeosideA", "Total"), # pick variables
          aes(color = Class))

Let’s customized a bit.

# remove zeroes since they don't log transform
# make log transformed columns
alkaloids_totals_log <- alkaloids_totals %>%
  filter(Tomatine != 0, 
         TotalLycoperosideFGEsculeosideA != 0,
         Total != 0) %>%
  mutate(log10_tomatine = log10(Tomatine),
         log10_FGA = log10(TotalLycoperosideFGEsculeosideA),
         log10_total = log10(Total))

alkaloids_totals_log %>%
    ggpairs(columns = c("log10_tomatine", "log10_FGA", "log10_total"),
          aes(color = Class, alpha = 0.5), # note alpha inside aes which is weird idk why
          columnLabels = c("Alpha-Tomatine", "Lycoperoside F/G\n Escueloside A", "Total Alkaloids"))

Manually making correlation plots with reshape::melt() and ggplot

Because some of the correlation specific packages are hard to customize, I am going to show you how to make your own plots by reshaping your data with reshape2::melt() and some base R functions, and plotting using the standard ggplot syntax.

library(reshape2) # contains melt()

Attaching package: 'reshape2'
The following object is masked from 'package:tidyr':

    smiths
# take cor matrix and convert to df with 3 columns: Var1, Var2, and value
melted_alkaloids_cor <- melt(alkaloids_cor)

# what does it look like?
head(melted_alkaloids_cor)
                  Var1              Var2       value
1    Dehydrotomatidine Dehydrotomatidine 1.000000000
2           Tomatidine Dehydrotomatidine 0.297446153
3 TotalDehydrotomatine Dehydrotomatidine 0.032491778
4             Tomatine Dehydrotomatidine 0.023823011
5 TotalHydroxytomatine Dehydrotomatidine 0.009912624
6 TotalAcetoxytomatine Dehydrotomatidine 0.032202892

First pass minimalist plotting

melted_alkaloids_cor %>%
  ggplot(aes(x = Var1, y = Var2, fill = value)) +
  geom_tile()

Lots to fix! What if we want only the upper or lower triangle, again since this plot is symmetric.

Upper triangle

Keep only the upper triangle.

# "save as"
alkaloids_upper <- alkaloids_cor

# use function lower.tri() and set the lower triangle all to NA
# then we can keep only the upper triangle
alkaloids_upper[lower.tri(alkaloids_upper)] <- NA

# melt to go back to long format
melted_alkaloids_upper <- melt(alkaloids_upper, na.rm = TRUE)

# did it work?
head(melted_alkaloids_upper) # yup
                   Var1                 Var2      value
1     Dehydrotomatidine    Dehydrotomatidine 1.00000000
11    Dehydrotomatidine           Tomatidine 0.29744615
12           Tomatidine           Tomatidine 1.00000000
21    Dehydrotomatidine TotalDehydrotomatine 0.03249178
22           Tomatidine TotalDehydrotomatine 0.37446722
23 TotalDehydrotomatine TotalDehydrotomatine 1.00000000

Lower triangle

Create a lower triangle object to plot.

# "save as"
alkaloids_lower <- alkaloids_cor

# use function upper.tri() and set the upper triangle all to NA
# then we can keep only the lower triangle
alkaloids_lower[upper.tri(alkaloids_lower)] <- NA

# melt to go back to long format
melted_alkaloids_lower <- melt(alkaloids_lower, na.rm = TRUE)

# did it work?
head(melted_alkaloids_lower) # yup
                  Var1              Var2       value
1    Dehydrotomatidine Dehydrotomatidine 1.000000000
2           Tomatidine Dehydrotomatidine 0.297446153
3 TotalDehydrotomatine Dehydrotomatidine 0.032491778
4             Tomatine Dehydrotomatidine 0.023823011
5 TotalHydroxytomatine Dehydrotomatidine 0.009912624
6 TotalAcetoxytomatine Dehydrotomatidine 0.032202892

Plot

# remember we made alkaloid_labels
print(alkaloid_labels)
 [1] "Dehydrotomatidine"                                 
 [2] "Tomatidine"                                        
 [3] "Dehydrotomatine"                                   
 [4] "Alpha-Tomatine"                                    
 [5] "Hydroxytomatine"                                   
 [6] "Acetoxytomatine"                                   
 [7] "Dehydrlycoperoside F, G, \nor Dehydroescueloside A"
 [8] "Lycoperoside F, G, \nor Escueloside A"             
 [9] "Escueloside B"                                     
[10] "Total Steroidal Alkaloids"                         
melted_alkaloids_lower %>%
  ggplot(aes(x = Var1, y = Var2, fill = value)) +
  geom_tile() +
  geom_text(aes(label = round(value, 2)), color = "black") +
  scale_fill_gradient2(low = "#f1a340",
                       mid = "#f7f7f7",
                       high = "#998ec3",
                       limits = c(-1, 1)) +
  scale_x_discrete(labels = alkaloid_labels) +
  scale_y_discrete(labels = alkaloid_labels) +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 90, hjust = 1),
        legend.justification = c(1, 0),
        legend.position = c(0.5, 0.7),
        legend.direction = "horizontal") +
  labs(fill = "Correlation \ncoefficient",
       x = "",
       y ="",
       title = "Correlation between steroidal alkaloids using \nPearson's correlation coefficient")

Useful resources

Back to top