These are my notes for the third and final tutorial of useR2018, and the
tutorial I was looking forward to the most. I *struggle* with missing value
imputation. It’s one of those things which I kind of get the theory of, but
fall over when trying to *do*. So I was keen to hear
Julie Joss and
Nick Tierney talk about their techniques and
associated R packages.

Your dataset with missing values after mean imputation. https://t.co/uGbderDUPk

— Andreas Brandmaier (@brandmaier) July 6, 2018

I ran into a wall pretty early on here in that I wasn’t very comfortable with Principal Component Analysis (PCA). I took the opportunity to learn a bit more about this common technique, and try to understand the intuition behind it.

The data and slides for Julie’s and Nick’s tutorial are available on Nick’s GitHub.

Required packages:

`install.packages(c("tidyverse", "ggplot2", "naniar", "visdat", "missMDA"))`

`Ozone`

data set

The data in use today is the Ozone data set from Airbreizh, a French association
that monitors air quality. We’re only going to focus on the quantitative
variables here, so we will drop the `WindDirection`

variable.

```
ozone <- read_csv("ozoneNA.csv") %>%
select(-X1, -WindDirection) # unnecessary row index
```

```
## Parsed with column specification:
## cols(
## X1 = col_integer(),
## maxO3 = col_integer(),
## T9 = col_double(),
## T12 = col_double(),
## T15 = col_double(),
## Ne9 = col_integer(),
## Ne12 = col_integer(),
## Ne15 = col_integer(),
## Vx9 = col_double(),
## Vx12 = col_double(),
## Vx15 = col_double(),
## maxO3v = col_integer(),
## WindDirection = col_character()
## )
```

## Patterns of missing data

The easiest way to deal with missing data is to delete it. But this ignores any pattern in the missing data, as well as any mechanism that leads to missing values. Some data sets can also contain a majority of values that have at least one missing value, and so deleting missing values would delete most of the data!

Missing values occur in three main patterns based on their relationship with the observed and even the unobserved variables of the data:

- Missing Completely at Random (
**MCAR**): probability of data being missing is independent of the observed and unobserved variables. - Missing at Random (
**MAR**): probability is not independent of the observed values (ie. is not random) but the observed variables do not fully account for the pattern. In this case, there may be unobserved variables affecting the probability that the data will be missing. - Missing not at random (
**MNAR**): probability of data being missing depends on the observed values of the data.

## Visualising missing data

*Multiple correspondence analysis* visualises data in such a way that
relationships between missing values are often made apparent. One implementation
of this is the `naniar`

package:

`vis_miss(ozone)`

We can repeat the visualisation with an option to `cluster`

the missing values,
making it easier to spot patterns, if any.

`vis_miss(ozone, cluster = TRUE)`

## Dealing with missing values

Suppose we *don’t* want to delete our missing data and pretend that everything
is fine. Then we might look to *impute* the missing values. That is to say, we
might want to look at the data we do have to determine what the missing values
might have been.

One of the easiest methods of imputation is that of *mean imputation*. In this
case, we replace all of the missing values by the mean of the present values of
the given variable.

We could also define a model on the present values of the data to predict what the missing value might have been. A simple linear regression might suffice.

A more sophisticated method involves the use of Principal Component Analysis.

## A refresher on PCA

I don’t understand Principal Component Analysis (PCA). Believe me, I’ve tried.

The goal of PCA is to find the subspace that best represents the data. The
outcome of PCA is a set of uncorrelated variables called *principal
components*.

This is a wonderful (but long) article on PCA (pdf). It’s gentle without being patronising. I’ll highlight some points in the article here, but it’s well worth finding the time to read the article in full.

First of all, recall the definition of an *eigenvector*. You probably
encountered this in your first or second year of university, and then promptly
forgot it. In the example below, the column vector \((6, 4)^T\) is an eigenvector
of the square matrix, because the result of the matrix multiplication is a scale
multiple of \((6, 4)^T\). In other words, the square matrix makes the eigenvector
*bigger*, but it doesn’t change its *direction*. The scalar multiple is \(4\), and
that’s the *eigenvalue* associated with the eigenvector.

\[
\left(\begin{array}{cc}
2 & 3 \\
2 & 1
\end{array}\right) \times
\left(\begin{array}{c}
6 \\
4
\end{array}\right) =
\left(\begin{array}{c}
24 \\
16
\end{array}\right) =
4 \left(\begin{array}{c}
6 \\
4
\end{array}\right)
\]
But how does this relate to stats? Well a data set gives rise to a *covariance
matrix*, in which the \(ij\)th cell of the matrix is the covariance between the
\(i\)th variable and the \(j\)th variable (the variance of each variable sits along
the diagonal). This is a square matrix, so we can find some eigenvectors.

I don’t remember a whole lot of the linear algebra I learnt a long time ago, so I had to be reminded of two quick theorems in play here. A covariance matrix is symmetric, so we can make use of the following:

- Eigenvectors of real symmetric matrices are real
- Eigenvectors of real symmetric matrices are orthogonal

doing math:

— Aleksandra Sobieska (@combinatola) May 17, 2018

step 1: relief when your problem reduces to linear algebra

step 2: panic when you realize you somehow don't actually know any linear algebra

Let’s go through
an example. Take the classic `mtcars`

data set, and draw a scatterplot between
`wt`

(weight) and `mpg`

(miles per gallon). We’re going to work on data that’s
been centered and rescaled so as to make the eigenvectors look right, as well
as for other reasons that will become apparent soon.

```
mtcars_scaled <- mtcars %>%
select(wt, mpg) %>%
mutate_all(scale)
mtcars_scaled %>% ggplot(aes(x = wt, y = mpg)) +
geom_point() +
coord_fixed() # makes the plot square
```

Let’s calculate the PCA eigenvectors. These are eigenvectors of the covariance
matrix. Because we have two variables we have two eigenvectors (another property
of symmetric matrices), which we’ll call `PC1`

and `PC2`

. You don’t need a
install a package to calculate these, as we can use the preinstalled `prcomp`

function.

```
PC <- prcomp(~ wt + mpg, data = mtcars_scaled)
PC
```

```
## Standard deviations (1, .., p=2):
## [1] 1.3666233 0.3637865
##
## Rotation (n x k) = (2 x 2):
## PC1 PC2
## wt 0.7071068 0.7071068
## mpg -0.7071068 0.7071068
```

```
PC1 <- PC$rotation[,1]
PC2 <- PC$rotation[,2]
mtcars_scaled %>% ggplot(aes(x = wt, y = mpg)) +
geom_point() +
geom_segment(aes(x = 0, xend = PC1[["wt"]], y = 0, yend = PC1[["mpg"]]), arrow = arrow(length = unit(0.5, "cm")), col = "red") +
geom_segment(aes(x = 0, xend = PC2[["wt"]], y = 0, yend = PC2[["mpg"]]), arrow = arrow(length = unit(0.5, "cm")), col = "red") +
coord_fixed() # makes the plot square
```

Look at those eigenvectors! The eigenvectors used in PCA are always normalised,
so that the magnitude of the vectors are all \(1\). That is, we have some
mutually orthogonal unit vectors…it’s a **change of basis**! Any point in the
data set can be described by the original \(x\) and \(y\) axes, but it can also be
described by a linear combination of our new PC eigenvectors!

So we haven’t done anything to the data yet, apart from some basic scaling and
centering. But every eigenvector has an associated eigenvalue. In this case, the
higher the eigenvalue, the more the eigenvector is stretched by the (modified)
covariance matrix. That is to say, the higher the eigenvalue, **the more variance
explained by the eigenvector**. This is why we had to rescale the data so that
each variable starts with a variance of \(1\)—if we didn’t, the variables with
higher variance, such as `mpg`

, would appear artifically more important.

`summary(PC)`

```
## Importance of components:
## PC1 PC2
## Standard deviation 1.3666 0.36379
## Proportion of Variance 0.9338 0.06617
## Cumulative Proportion 0.9338 1.00000
```

PCA’s primary use is for dimension reduction. You can use the ranking of the eigenvalues to determine what eigenvectors contribute the most to the variance of the data. You’ll drop a dimension by losing some of your information, but it will be the least valuable information.

## PCA imputation

At this point I had re-learnt enough PCA to get by, Time to revisit the matter of missing data!

Here’s how this works. First we need to choose a number of dimensions \(S\) to keep at each step (more about this later).

- Start with mean imputation, which assigns the variable mean to every missing value.
- Apply PCA and keep \(S\) dimensions.
- Re-impute the missing values with the new mean.
- Repeat steps 2 and 3 until the values converge.

The `missMDA`

package handles this for us. First we need to work out how many
dimensions we want to keep when doing our PCA. The `missMDA`

package contains
an `estim_ncpPCA`

function that helps us determine the value of \(S\) that
minimises the **mean squared error of prediction (MSEP)**.

`nb <- estim_ncpPCA(ozone, method.cv = "Kfold")`

```
ggplot(data = NULL, aes(x = 0:5, y = nb$criterion)) +
geom_point() +
geom_line() +
xlab("nb dim") +
ylab("MSEP")
```

We can see that the MSEP is minimised when the \(S\) is 2, a
number we can also access with `nb$ncp`

. We can now perform the actual PCA
imputation. This doesn’t work with tibbles, so I convert the data to a data
frame before piping it into the impute function.

`ozone_comp <- ozone %>% as.data.frame %>% imputePCA(ncp = nb$ncp)`

Now let’s compare the original data to the complete data. The `imputePCA`

function returns a list, which we convert to a tibble. There are also some
new columns containing fitted values, which we’ll set aside for an easier
comparison.

We’ll also remove the `completedObs.`

prefix of the variable
names. The regex pattern used here, `".*(?<=\\.)"`

, matches everything up to
and including the last dot in the column names. I use this pattern * a lot*.
Note that you’ll need `perl = TRUE`

to use the lookahead.

`ozone %>% head`

```
## # A tibble: 6 x 11
## maxO3 T9 T12 T15 Ne9 Ne12 Ne15 Vx9 Vx12 Vx15 maxO3v
## <int> <dbl> <dbl> <dbl> <int> <int> <int> <dbl> <dbl> <dbl> <int>
## 1 87 15.6 18.5 NA 4 4 8 0.695 -1.71 -0.695 84
## 2 82 NA NA NA 5 5 7 -4.33 -4 -3 87
## 3 92 15.3 17.6 19.5 2 NA NA 2.95 NA 0.521 82
## 4 114 16.2 19.7 NA 1 1 0 NA 0.347 -0.174 92
## 5 94 NA 20.5 20.4 NA NA NA -0.5 -2.95 -4.33 114
## 6 80 17.7 19.8 18.3 6 NA 7 -5.64 -5 -6 94
```

```
ozone_comp %>%
as.data.frame %>% # seems to need this intermediate step?
as_tibble %>%
select(contains("complete")) %>%
rename_all(gsub, pattern = ".*(?<=\\.)", replacement = '', perl = TRUE) %>%
head
```

```
## # A tibble: 6 x 11
## maxO3 T9 T12 T15 Ne9 Ne12 Ne15 Vx9 Vx12 Vx15 maxO3v
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 87 15.6 18.5 20.5 4 4 8 0.695 -1.71 -0.695 84
## 2 82 18.5 20.9 21.8 5 5 7 -4.33 -4 -3 87
## 3 92 15.3 17.6 19.5 2 3.98 3.81 2.95 1.95 0.521 82
## 4 114 16.2 19.7 24.7 1 1 0 2.04 0.347 -0.174 92
## 5 94 19.0 20.5 20.4 5.29 5.27 5.06 -0.5 -2.95 -4.33 114
## 6 80 17.7 19.8 18.3 6 7.02 7 -5.64 -5 -6 94
```

## I missed a lot

Because I was catching up on PCA, I missed a fair chunk of the tutorial. In particular, I missed a whole discussion on evaluating how well the missing data was imputed. I also missed some stuff on random forests, and I love random forests!

But I learnt a tonne, and I’m grateful for the opportunity to dig into two topics I usually struggle with: PCA and missing value imputation. Thank you to Julie and Nick for the tutorial.

Now, onward to the official launch of the #useR2018!

## Share this post

Twitter

Google+

Facebook

Reddit

LinkedIn

StumbleUpon

Pinterest

Email