10 minute read

These are my notes for the third and final tutorial of useR2018, and the tutorial I was looking forward to the most. I struggle with missing value imputation. It’s one of those things which I kind of get the theory of, but fall over when trying to do. So I was keen to hear Julie Joss and Nick Tierney talk about their techniques and associated R packages.

I ran into a wall pretty early on here in that I wasn’t very comfortable with Principal Component Analysis (PCA). I took the opportunity to learn a bit more about this common technique, and try to understand the intuition behind it.

The data and slides for Julie’s and Nick’s tutorial are available on Nick’s GitHub.

Required packages:

install.packages(c("tidyverse", "ggplot2", "naniar", "visdat", "missMDA"))

Ozone data set

The data in use today is the Ozone data set from Airbreizh, a French association that monitors air quality. We’re only going to focus on the quantitative variables here, so we will drop the WindDirection variable.

ozone <- read_csv("ozoneNA.csv") %>% 
    select(-X1, -WindDirection) # unnecessary row index
## Parsed with column specification:
## cols(
##   X1 = col_integer(),
##   maxO3 = col_integer(),
##   T9 = col_double(),
##   T12 = col_double(),
##   T15 = col_double(),
##   Ne9 = col_integer(),
##   Ne12 = col_integer(),
##   Ne15 = col_integer(),
##   Vx9 = col_double(),
##   Vx12 = col_double(),
##   Vx15 = col_double(),
##   maxO3v = col_integer(),
##   WindDirection = col_character()
## )

Patterns of missing data

The easiest way to deal with missing data is to delete it. But this ignores any pattern in the missing data, as well as any mechanism that leads to missing values. Some data sets can also contain a majority of values that have at least one missing value, and so deleting missing values would delete most of the data!

Missing values occur in three main patterns based on their relationship with the observed and even the unobserved variables of the data:

  • Missing Completely at Random (MCAR): probability of data being missing is independent of the observed and unobserved variables.
  • Missing at Random (MAR): probability is not independent of the observed values (ie. is not random) but the observed variables do not fully account for the pattern. In this case, there may be unobserved variables affecting the probability that the data will be missing.
  • Missing not at random (MNAR): probability of data being missing depends on the observed values of the data.

Visualising missing data

Multiple correspondence analysis visualises data in such a way that relationships between missing values are often made apparent. One implementation of this is the naniar package:


We can repeat the visualisation with an option to cluster the missing values, making it easier to spot patterns, if any.

vis_miss(ozone, cluster = TRUE)

Dealing with missing values

Suppose we don’t want to delete our missing data and pretend that everything is fine. Then we might look to impute the missing values. That is to say, we might want to look at the data we do have to determine what the missing values might have been.

One of the easiest methods of imputation is that of mean imputation. In this case, we replace all of the missing values by the mean of the present values of the given variable.

We could also define a model on the present values of the data to predict what the missing value might have been. A simple linear regression might suffice.

A more sophisticated method involves the use of Principal Component Analysis.

A refresher on PCA

I don’t understand Principal Component Analysis (PCA). Believe me, I’ve tried.

The goal of PCA is to find the subspace that best represents the data. The outcome of PCA is a set of uncorrelated variables called principal components.

This is a wonderful (but long) article on PCA (pdf). It’s gentle without being patronising. I’ll highlight some points in the article here, but it’s well worth finding the time to read the article in full.

First of all, recall the definition of an eigenvector. You probably encountered this in your first or second year of university, and then promptly forgot it. In the example below, the column vector \((6, 4)^T\) is an eigenvector of the square matrix, because the result of the matrix multiplication is a scale multiple of \((6, 4)^T\). In other words, the square matrix makes the eigenvector bigger, but it doesn’t change its direction. The scalar multiple is \(4\), and that’s the eigenvalue associated with the eigenvector.

\[ \left(\begin{array}{cc} 2 & 3 \\ 2 & 1 \end{array}\right) \times \left(\begin{array}{c} 6 \\ 4 \end{array}\right) = \left(\begin{array}{c} 24 \\ 16 \end{array}\right) = 4 \left(\begin{array}{c} 6 \\ 4 \end{array}\right) \] But how does this relate to stats? Well a data set gives rise to a covariance matrix, in which the \(ij\)th cell of the matrix is the covariance between the \(i\)th variable and the \(j\)th variable (the variance of each variable sits along the diagonal). This is a square matrix, so we can find some eigenvectors.

I don’t remember a whole lot of the linear algebra I learnt a long time ago, so I had to be reminded of two quick theorems in play here. A covariance matrix is symmetric, so we can make use of the following:

  • Eigenvectors of real symmetric matrices are real
  • Eigenvectors of real symmetric matrices are orthogonal

Let’s go through an example. Take the classic mtcars data set, and draw a scatterplot between wt (weight) and mpg (miles per gallon). We’re going to work on data that’s been centered and rescaled so as to make the eigenvectors look right, as well as for other reasons that will become apparent soon.

mtcars_scaled <- mtcars %>% 
    select(wt, mpg) %>% 
mtcars_scaled %>% ggplot(aes(x = wt, y = mpg)) + 
    geom_point() + 
    coord_fixed() # makes the plot square

Let’s calculate the PCA eigenvectors. These are eigenvectors of the covariance matrix. Because we have two variables we have two eigenvectors (another property of symmetric matrices), which we’ll call PC1 and PC2. You don’t need a install a package to calculate these, as we can use the preinstalled prcomp function.

PC <- prcomp(~ wt + mpg, data = mtcars_scaled)
## Standard deviations (1, .., p=2):
## [1] 1.3666233 0.3637865
## Rotation (n x k) = (2 x 2):
##            PC1       PC2
## wt   0.7071068 0.7071068
## mpg -0.7071068 0.7071068
PC1 <- PC$rotation[,1]
PC2 <- PC$rotation[,2]

mtcars_scaled %>% ggplot(aes(x = wt, y = mpg)) + 
    geom_point() +
    geom_segment(aes(x = 0, xend = PC1[["wt"]], y = 0, yend = PC1[["mpg"]]), arrow = arrow(length = unit(0.5, "cm")), col = "red") + 
    geom_segment(aes(x = 0, xend = PC2[["wt"]], y = 0, yend = PC2[["mpg"]]), arrow = arrow(length = unit(0.5, "cm")), col = "red") + 
    coord_fixed() # makes the plot square

Look at those eigenvectors! The eigenvectors used in PCA are always normalised, so that the magnitude of the vectors are all \(1\). That is, we have some mutually orthogonal unit vectors…it’s a change of basis! Any point in the data set can be described by the original \(x\) and \(y\) axes, but it can also be described by a linear combination of our new PC eigenvectors!

So we haven’t done anything to the data yet, apart from some basic scaling and centering. But every eigenvector has an associated eigenvalue. In this case, the higher the eigenvalue, the more the eigenvector is stretched by the (modified) covariance matrix. That is to say, the higher the eigenvalue, the more variance explained by the eigenvector. This is why we had to rescale the data so that each variable starts with a variance of \(1\)—if we didn’t, the variables with higher variance, such as mpg, would appear artifically more important.

## Importance of components:
##                           PC1     PC2
## Standard deviation     1.3666 0.36379
## Proportion of Variance 0.9338 0.06617
## Cumulative Proportion  0.9338 1.00000

PCA’s primary use is for dimension reduction. You can use the ranking of the eigenvalues to determine what eigenvectors contribute the most to the variance of the data. You’ll drop a dimension by losing some of your information, but it will be the least valuable information.

PCA imputation

At this point I had re-learnt enough PCA to get by, Time to revisit the matter of missing data!

Here’s how this works. First we need to choose a number of dimensions \(S\) to keep at each step (more about this later).

  1. Start with mean imputation, which assigns the variable mean to every missing value.
  2. Apply PCA and keep \(S\) dimensions.
  3. Re-impute the missing values with the new mean.
  4. Repeat steps 2 and 3 until the values converge.

The missMDA package handles this for us. First we need to work out how many dimensions we want to keep when doing our PCA. The missMDA package contains an estim_ncpPCA function that helps us determine the value of \(S\) that minimises the mean squared error of prediction (MSEP).

nb <- estim_ncpPCA(ozone, method.cv = "Kfold")
ggplot(data = NULL, aes(x = 0:5, y = nb$criterion)) + 
    geom_point() + 
    geom_line() + 
    xlab("nb dim") + 

We can see that the MSEP is minimised when the \(S\) is 2, a number we can also access with nb$ncp. We can now perform the actual PCA imputation. This doesn’t work with tibbles, so I convert the data to a data frame before piping it into the impute function.

ozone_comp <- ozone %>% as.data.frame %>% imputePCA(ncp = nb$ncp)

Now let’s compare the original data to the complete data. The imputePCA function returns a list, which we convert to a tibble. There are also some new columns containing fitted values, which we’ll set aside for an easier comparison.

We’ll also remove the completedObs. prefix of the variable names. The regex pattern used here, ".*(?<=\\.)", matches everything up to and including the last dot in the column names. I use this pattern * a lot*. Note that you’ll need perl = TRUE to use the lookahead.

ozone %>% head
## # A tibble: 6 x 11
##   maxO3    T9   T12   T15   Ne9  Ne12  Ne15     Vx9    Vx12   Vx15 maxO3v
##   <int> <dbl> <dbl> <dbl> <int> <int> <int>   <dbl>   <dbl>  <dbl>  <int>
## 1    87  15.6  18.5  NA       4     4     8   0.695  -1.71  -0.695     84
## 2    82  NA    NA    NA       5     5     7  -4.33   -4     -3         87
## 3    92  15.3  17.6  19.5     2    NA    NA   2.95   NA      0.521     82
## 4   114  16.2  19.7  NA       1     1     0  NA       0.347 -0.174     92
## 5    94  NA    20.5  20.4    NA    NA    NA  -0.5    -2.95  -4.33     114
## 6    80  17.7  19.8  18.3     6    NA     7  -5.64   -5     -6         94
ozone_comp %>% 
    as.data.frame %>% # seems to need this intermediate step?
    as_tibble %>% 
    select(contains("complete")) %>% 
    rename_all(gsub, pattern = ".*(?<=\\.)", replacement = '', perl = TRUE) %>%  
## # A tibble: 6 x 11
##   maxO3    T9   T12   T15   Ne9  Ne12  Ne15    Vx9   Vx12   Vx15 maxO3v
##   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
## 1    87  15.6  18.5  20.5  4     4     8     0.695 -1.71  -0.695     84
## 2    82  18.5  20.9  21.8  5     5     7    -4.33  -4     -3         87
## 3    92  15.3  17.6  19.5  2     3.98  3.81  2.95   1.95   0.521     82
## 4   114  16.2  19.7  24.7  1     1     0     2.04   0.347 -0.174     92
## 5    94  19.0  20.5  20.4  5.29  5.27  5.06 -0.5   -2.95  -4.33     114
## 6    80  17.7  19.8  18.3  6     7.02  7    -5.64  -5     -6         94

I missed a lot

Because I was catching up on PCA, I missed a fair chunk of the tutorial. In particular, I missed a whole discussion on evaluating how well the missing data was imputed. I also missed some stuff on random forests, and I love random forests!

But I learnt a tonne, and I’m grateful for the opportunity to dig into two topics I usually struggle with: PCA and missing value imputation. Thank you to Julie and Nick for the tutorial.

Now, onward to the official launch of the #useR2018!