# Principal Components Analysis Recitation Solutions π

Week 10

## Introduction

Today is the first recitation for Module 4 where we put together a lot of the material weβve learned in the first 3 modules of this course. Todayβs material is on conducting principal components analysis (PCA) using R, and visualizing the results with some tools weβve already learned to use, and some new wrangling and viz tips along the way.

```
library(tidyverse) # everything
library(readxl) # reading in excel sheets
library(factoextra) # easy PCA plotting
library(glue) # easy pasting
library(ggrepel) # repelling labels away from their points
library(patchwork) # for combining and arranging plots
```

#### Read in data

We will be using data about pizza, which includes data collected about the nutritional information of 300 different grocery store pizzas, from 10 brands.

`<- read_csv(file = "https://raw.githubusercontent.com/f-imp/Principal-Component-Analysis-PCA-over-3-datasets/master/datasets/Pizza.csv") pizza `

```
Rows: 300 Columns: 9
ββ Column specification ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Delimiter: ","
chr (1): brand
dbl (8): id, mois, prot, fat, ash, sodium, carb, cal
βΉ Use `spec()` to retrieve the full column specification for this data.
βΉ Specify the column types or set `show_col_types = FALSE` to quiet this message.
```

How different are each of the different brands of pizzas analyzed overall?

## 1. Run a PCA

Letβs take a look at this new dataset

`::kable(head(pizza)) knitr`

brand | id | mois | prot | fat | ash | sodium | carb | cal |
---|---|---|---|---|---|---|---|---|

A | 14069 | 27.82 | 21.43 | 44.87 | 5.11 | 1.77 | 0.77 | 4.93 |

A | 14053 | 28.49 | 21.26 | 43.89 | 5.34 | 1.79 | 1.02 | 4.84 |

A | 14025 | 28.35 | 19.99 | 45.78 | 5.08 | 1.63 | 0.80 | 4.95 |

A | 14016 | 30.55 | 20.15 | 43.13 | 4.79 | 1.61 | 1.38 | 4.74 |

A | 14005 | 30.49 | 21.28 | 41.65 | 4.82 | 1.64 | 1.76 | 4.67 |

A | 14075 | 31.14 | 20.23 | 42.31 | 4.92 | 1.65 | 1.40 | 4.67 |

```
<- prcomp(pizza[,-c(1:2)],
pizza_pca center = TRUE,
scale = TRUE)
```

We can also look at the output of our PCA in a different way using the function `summary()`

.

`summary(pizza_pca) `

```
Importance of components:
PC1 PC2 PC3 PC4 PC5 PC6 PC7
Standard deviation 2.042 1.5134 0.64387 0.3085 0.16636 0.01837 0.003085
Proportion of Variance 0.596 0.3272 0.05922 0.0136 0.00395 0.00005 0.000000
Cumulative Proportion 0.596 0.9232 0.98240 0.9960 0.99995 1.00000 1.000000
```

We can convert this summary into something later usable by extraction the element `importance`

from `summary(alkaloids_pca)`

and converting it to a dataframe.

```
<- summary(pizza_pca)$importance %>%
importance as.data.frame()
::kable(head(importance)) knitr
```

PC1 | PC2 | PC3 | PC4 | PC5 | PC6 | PC7 | |
---|---|---|---|---|---|---|---|

Standard deviation | 2.042494 | 1.513426 | 0.6438652 | 0.3085032 | 0.1663641 | 0.0183741 | 0.0030853 |

Proportion of Variance | 0.595970 | 0.327210 | 0.0592200 | 0.0136000 | 0.0039500 | 0.0000500 | 0.0000000 |

Cumulative Proportion | 0.595970 | 0.923180 | 0.9824000 | 0.9960000 | 0.9999500 | 1.0000000 | 1.0000000 |

By looking at the summary we can see, for example, that the first two PCs explain 92.32% of variance.

## 2. Make a scree plot of the percent variance explained by each component

### Using `fviz_eig()`

We can do this quickly using `fviz_eig()`

.

`fviz_eig(pizza_pca)`

### Manually

If you wanted to make a scree plot manually, you could by plotting using a wrangled version of the `importance`

dataframe we made earlier.

```
# pivot longer
<- importance %>%
importance_tidy rownames_to_column(var = "measure") %>%
pivot_longer(cols = PC1:PC7,
names_to = "PC",
values_to = "value")
# plot
%>%
importance_tidy filter(measure == "Proportion of Variance") %>%
ggplot(aes(x = PC, y = value)) +
geom_col(alpha = 0.1, color = "black") +
scale_y_continuous(labels = scales::percent) +
theme_minimal() +
labs(x = "Principal component",
y = "Percent variance explained",
title = "Pizza scree plot")
```

## 3. Make a scores plot of samples, coloring each sample by its brand

### Using `fviz_pca_ind()`

We can also look at a scores plot using `fviz_pca_ind()`

where ind means individuals. Here, each point is a sample.

`fviz_pca_ind(pizza_pca)`

#### Manually

We want to plot the scores, which are in provided in `pizza_pca$x`

.

We can convert the list into a dataframe of scores values by using `as.data.frame()`

. Then we can bind back our relevant metadata so theyβre all together. Note, to use `bind_cols()`

both datasets need to be in the same order. In this case they are so we are good.

```
# create a df of alkaloids_pca$x
<- as.data.frame(pizza_pca$x)
scores_raw
# bind meta-data
<- bind_cols(pizza[,1], # first columns
scores scores_raw)
```

```
# create objects indicating percent variance explained by PC1 and PC2
<- round((importance[2,1])*100, # index 2nd row, 1st column, times 100
PC1_percent 1) # round to 1 decimal
<- round((importance[2,2])*100, 1)
PC2_percent
# plot
<- scores %>%
(scores_plot ggplot(aes(x = PC1, y = PC2, fill = brand)) +
geom_hline(yintercept = 0, linetype = "dashed") +
geom_vline(xintercept = 0, linetype = "dashed") +
geom_point(shape = 21, color = "black") +
scale_fill_brewer(palette = "Set3") +
theme_minimal() +
labs(x = glue("PC1: {PC1_percent}%"),
y = glue("PC2: {PC2_percent}%"),
title = "PCA Scores Plot of Pizza Proximate Analysis Across 10 Brands",
fill = "Brand"))
```

## 4. Make a loadings plot of samples

### Using `fviz_pca_var()`

We can also look at a loadings plot using `fviz_pca_var()`

where var means variables. Here, each point is a variable.

`fviz_pca_var(pizza_pca)`

### Manually

We can also make a more customized loadings plot manually using ggplot and using the dataframe `alkaloids_pca$rotation`

.

```
# grab raw loadings, without any metadata
<- as.data.frame(pizza_pca$rotation)
loadings_raw
# convert rownames to column
<- loadings_raw %>%
loadings rownames_to_column(var = "analysis_type")
# create vector of labels as we want them to appear
<- c("Moisture",
analysis_type_labels "Protein",
"Fat",
"Ash",
"Sodium",
"Carbo-\nhydrates",
"Calories")
```

We can then plot with ggplot like normal.

```
<- loadings %>%
(loadings_plot ggplot(aes(x = PC1, y = PC2, label = analysis_type_labels)) +
geom_hline(yintercept = 0, linetype = "dashed") +
geom_vline(xintercept = 0, linetype = "dashed") +
geom_point() +
geom_label_repel() +
scale_fill_brewer() +
theme_minimal() +
labs(x = glue("PC1: {PC1_percent}%"),
y = glue("PC2: {PC2_percent}%"),
title = "PCA Loadings Plot for Pizza"))
```

## 5. Create either a biplot, or a visualization that shows both your scores and loadings plot together.

## With `patchwork`

`+ loadings_plot scores_plot `

## Biplot

### Using `fviz_pca()`

.

You can make a biplot quickly with `fviz_pca()`

. Note, `fviz_pca_biplot()`

and `fviz_pca()`

are the same.

`fviz_pca(pizza_pca)`

Instead of making this plot manually, letβs go through how to alter the existing plot made with `fviz_pca()`

. We can do this because `factoextra`

creates ggplot objects. To start off, we need to be using a dataframe that includes our metadata.

```
# save as a new df
<- pizza_pca
pizza_pca_labelled
# assign alkaloid_labels to rownames
rownames(pizza_pca_labelled$rotation) <- analysis_type_labels
# plot
fviz_pca(pizza_pca_labelled, # pca object
label = "var",
repel = TRUE,
geom.var = c("text", "point", "arrow"),
col.var = "black") +
geom_point(aes(fill = pizza$brand), shape = 21) +
scale_fill_brewer(palette = "Set3") +
theme_minimal() +
theme(legend.position = "none") +
labs(x = glue("PC1: {PC1_percent}%"),
y = glue("PC2: {PC2_percent}%"),
title = "PCA Pizza Biplot",
fill = "Brand")
```

### Manually

Or we could do this manually. First we need to scale our data so that the scores and loadings are on the same scale.

I can write a quick function to allow normalization.

`<- function(x) return((x - min(x))/(max(x) - min(x))) normalize `

Then I can nornalize the scores using the scale function, since the loadings are already normalized.

```
<- scores %>%
scores_normalized mutate(PC1_norm = scale(normalize(PC1), center = TRUE, scale = FALSE)) %>%
mutate(PC2_norm = scale(normalize(PC2), center = TRUE, scale = FALSE)) %>%
select(brand, PC1_norm, PC2_norm, everything()) # reorder
```

How did it go? PC1_norm and PC2_norm should all now be between -1 and 1

`head(scores_normalized) # looks good`

```
# A tibble: 6 Γ 10
brand PC1_norm[,1] PC2_norm[,1] PC1 PC2 PC3 PC4 PC5 PC6
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 A 0.653 0.475 5.00 2.67 0.0393 -0.144 0.284 -0.00233
2 A 0.654 0.448 5.02 2.53 0.0969 -0.353 0.215 0.00295
3 A 0.626 0.474 4.80 2.67 0.0753 0.108 -0.0350 0.00541
4 A 0.582 0.405 4.46 2.28 0.120 0.0539 0.174 0.00562
5 A 0.583 0.383 4.46 2.16 0.000736 -0.117 0.313 0.00169
6 A 0.587 0.384 4.50 2.16 0.175 -0.115 0.200 0.00518
# βΉ 1 more variable: PC7 <dbl>
```

Now we can plot together the scores and loadings in one plot.

```
%>%
scores_normalized ggplot() +
geom_hline(yintercept = 0, linetype = "dashed") +
geom_vline(xintercept = 0, linetype = "dashed") +
geom_point(aes(x = PC1_norm, y = PC2_norm, fill = brand), shape = 21) +
geom_point(data = loadings, aes(x = PC1, y = PC2)) +
geom_text_repel(data = loadings,
aes(x = PC1, y = PC2, label = analysis_type_labels)) +
scale_fill_brewer(palette = "Set3") +
theme_minimal() +
labs(x = glue("PC1: {PC1_percent}%"),
y = glue("PC2: {PC2_percent}%"),
title = "PCA Biplot of Pizza Proximate Analysis Across 10 Brands",
fill = "Brand",
caption = "Proximate analyses in black are loadings")
```