Model training

David Neuzerling

2019-12-11

This package exists to explore the concept of creating a machine learning model as an R package, similar to the established concept of an analysis as an R package. The idea here is that, using vignettes, we can train the model by installing the package. The functions in the package then allow the user to score new data with the trained model. To demonstrate this I’ve created an extremely simple sentiment analysis model based on review data from the UCI Machine Learning Repository.

I thought this might work because of a few things:

However, I have my doubts:

No matter what, I think these sorts of projects have to be shared, even if I don’t think that this is a major success!

A quick shout out for the excellent book on R packages by Hadley Wickham. It’s well worth keeping bookmarked.

library(dplyr)
library(ggplot2)
library(text2vec)
library(tidytext)
library(randomForest)

knitr::opts_chunk$set(echo = TRUE, cache = FALSE)

package_root <- dirname(getwd()) # dirname moves up a directory
devtools::load_all(package_root)

Data load

I haven’t kept the data in this git repository, opting instead to download it if it doesn’t already exist. It’s a fairly small data set though (3000 rows).

download_data is a package function that downloads and unzips the source data into the inst/extdata directory (creating it if necessary). On package compilation, everything in the inst folder is moved up to the root directory of the package, and so we can find the extdata directory in the finished product.

extdata <- file.path(package_root, "inst", "extdata")
if (!dir.exists(file.path(extdata, "sentiment labelled sentences"))) {
  data_source_url <- paste0("https://archive.ics.uci.edu",
                            "/ml/machine-learning-databases/00331",
                            "/sentiment%20labelled%20sentences.zip")
  download_data(data_source_url, extdata)
}

Data is loaded in with another custom function, read_review_file. This is just readr::read_tsv with some special options to cover the pecularities of the raw data. All of these custom functions are documented and stored in the R directory. Once the package is installed, function manuals can be called in the usual way (eg. ?read_review_file).

This is a simple analysis, so let’s just stick to discrete categories for sentiment: “good” and “bad”. I don’t care too much about how the model performs, as long as it functions.

data_files <- c("amazon_cells_labelled.txt",
                "imdb_labelled.txt",
                "yelp_labelled.txt") %>% 
  file.path(extdata, "sentiment labelled sentences", .)
reviews <- data_files %>% 
  purrr::map(read_review_file) %>%
  purrr::reduce(rbind) %>% 
  mutate(
    sentiment = ifelse(sentiment == 1, "good", "bad")
  )
reviews %>% head
## # A tibble: 6 x 2
##   review                                                          sentiment
##   <chr>                                                           <chr>    
## 1 So there is no way for me to plug it in here in the US unless … bad      
## 2 Good case, Excellent value.                                     good     
## 3 Great for the jawbone.                                          good     
## 4 Tied to charger for conversations lasting more than 45 minutes… bad      
## 5 The mic is great.                                               good     
## 6 I have to jiggle the plug to get it to line up right to get de… bad

Exploring data

We check for missing data using the naniar package:

reviews %>% naniar::miss_var_summary()
## # A tibble: 2 x 3
##   variable  n_miss pct_miss
##   <chr>      <int>    <dbl>
## 1 review         0        0
## 2 sentiment      0        0

Let’s take a look at which words are the most frequent. First we create a data frame such that each row is an occurrence of a word. Note that we remove stop words — these are words like “the” that are common and usually provide little semantic content to the text.

words <- reviews %>% 
  tidytext::unnest_tokens(
    word, 
    review
  ) %>% 
  anti_join(
    tidytext::stop_words, 
    by = "word"
  )
words %>% head
## # A tibble: 6 x 2
##   sentiment word     
##   <chr>     <chr>    
## 1 bad       plug     
## 2 bad       converter
## 3 good      excellent
## 4 good      jawbone  
## 5 bad       tied     
## 6 bad       charger

Now we’ll plot the mosst frequently occurring words, keeping a note of which words are “good” and which words are “bad”.

words %>%
  count(word, sentiment, sort = TRUE) %>%
  head(20) %>%
  mutate(word = reorder(word, n)) %>%
  ggplot(aes(word, n, fill = sentiment)) +
  geom_col() + 
  # scale_fill_manual(
  #   values = wine_plot_colours
  # ) +
  xlab(NULL) +
  theme(text = element_text(size = 16)) +
  coord_flip() +
  ggtitle("Frequency of words")

There are no surprises here! “Bad” is universally bad and “love” is universally good. It’s comforting to see. We’ll note this and use these words in our unit tests.

I’m not sure what purpose word clouds serve, but they seem almost mandatory.

words %>%
  count(word) %>%
  with(
    wordcloud::wordcloud(
      word, 
      n, 
      max.words = 100
    )
  )

Preprocessing

We need to apply some preprocessing to our text before we can feed it into a model. The first round of preprocessing is simply ignoring case, punctuation and numbers:

text_preprocessor
## function(x) {
##   x %>%
##     tolower %>%
##     tm::removeNumbers() %>%
##     tm::removePunctuation()
## }
## <environment: namespace:ModelAsAPackage>

I’m actually not sure that we should be removing numbers here. We’re dealing with reviews, after all, and a review like “10/10” certainly tells us something about sentiment. But that’s beyond the scope of this package.

The next round of processing involves tokenising our words. This is a process of stripping words down to their base. Another custom function, stem_tokeniser plays this role, by calling on the Porter stemming algorithm:

stem_tokeniser("information informed informing informs")
## [[1]]
## [1] "inform" "inform" "inform" "inform"

Now we’ll define our vocabulary. The vocabulary is the domain of the problem — the words that will go into the model as features. Dimensionality is an issue here, we’’ll prune our vocabulary to include only words that occur a minimum number of times. The vocabulary is subject to domain shift — if an incoming piece of text contains a word that isn’t in the vocabulary, it will be ignored by the model.

We’re going to insist that every word in the vocabulary appears in at least 25 of the reviews.

vocabulary <- create_vocabulary(reviews$review,
                                doc_proportion_min = 25 / nrow(reviews))
vocabulary
## Number of docs: 3000 
## 1149 stopwords: a, a's, able, about, above, according ... 
## ngram_min = 1; ngram_max = 1 
## Vocabulary: 
##           term term_count doc_count
##  1:        fit         25        25
##  2:        buy         25        25
##  3:       cast         25        25
##  4:      happi         25        25
##  5:    impress         25        25
##  6:    perform         25        25
##  7:     expect         26        26
##  8:       plot         27        26
##  9:        eat         27        27
## 10:        day         27        27
## 11:       quit         27        27
## 12:      worth         27        27
## 13:       mani         28        27
## 14:      actor         28        27
## 15:    restaur         28        27
## 16:       star         28        26
## 17:       wont         28        28
## 18:     doesnt         28        28
## 19:    comfort         29        28
## 20:       piec         29        29
## 21:      enjoy         31        29
## 22:        lot         31        30
## 23:      money         31        31
## 24:     experi         31        31
## 25:       call         32        31
## 26:    perfect         32        32
## 27:      stori         32        32
## 28:       wait         32        26
## 29:       real         32        31
## 30:       poor         33        31
## 31:      peopl         33        32
## 32:    definit         33        33
## 33:       play         34        33
## 34:    terribl         34        34
## 35:      scene         34        32
## 36:      everi         35        32
## 37:      littl         35        35
## 38:    everyth         35        35
## 39:       feel         36        36
## 40:        doe         37        36
## 41:      minut         37        33
## 42:       amaz         37        37
## 43:      worst         39        38
## 44:     friend         40        40
## 45:     pretti         41        40
## 46:        ear         41        36
## 47:      didnt         42        41
## 48:        tri         42        41
## 49:        act         46        46
## 50:    batteri         47        46
## 51:       wast         47        47
## 52:      watch         53        52
## 53:        ani         53        52
## 54:       nice         54        54
## 55:      price         54        54
## 56:      sound         54        51
## 57:    headset         55        53
## 58:         im         56        56
## 59:      excel         56        54
## 60:     becaus         57        57
## 61:    charact         58        54
## 62:        ive         60        56
## 63:    product         61        61
## 64: disappoint         61        61
## 65:  recommend         64        64
## 66:    qualiti         66        65
## 67:       onli         78        77
## 68:       dont         85        82
## 69:       love         93        92
## 70:        bad        101        90
## 71:     realli        103       100
## 72:     servic        107       106
## 73:       food        125       120
## 74:       time        136       134
## 75:      phone        174       165
## 76:       film        184       172
## 77:       movi        208       191
## 78:       veri        243       226
##           term term_count doc_count

Finally, we’ll create a vectoriser using the text2vec package. This allows us to map new text onto the vocabulary we’ve just created.

Actually, much of what you see here uses the text2vec package. I’m fond of this package because it’s designed with the idea that you may need to score new data that comes in after you’ve trained your model, so always need to be able to process new text!

vectoriser <- vocabulary %>% text2vec::vocab_vectorizer()

One quick note, though: the itoken function in this package creates an iterator of tokens. It can be called like this:

tokens <- text2vec::itoken(
  unprocessed_text, 
  preprocessor = text_preprocessor, 
  tokenizer = stem_tokeniser,
  progressbar = FALSE
)

However, I had great trouble with this method, with words not being properly tokenised before making it into the vectoriser. So I’ve done this instead:

processed_text <- stem_tokeniser(text_preprocessor(unprocessed_text))

text2vec::itoken(
  processed_text, 
  progressbar = FALSE
)

I would have thought that the two pieces of code were equivalent, but my unit tests fail with the first example. I’m putting this here as an unknown!

Creating a document term matrix

The input for our model algorithm is a document term matrix. This is a matrix in which every row represents one of our 3000 reviews, and every column uses one of the 78 terms in our vocabulary. We use the map_to_dtm function which allows us to map raw text onto a new dtm.

map_to_dtm
## function(x,
##                        vectoriser = ModelAsAPackage::vectoriser,
##                        tfidf = ModelAsAPackage::tfidf) {
##   processed_text <- stem_tokeniser(text_preprocessor(x))
## 
##   # For some reason, the preprocessor and stem_tokeniser don't take effect if
##   # I put them in the itoken function as values to the relevant arguments.
##   # Please let me know if you understand why this is!
##   tokens <- text2vec::itoken(
##     processed_text,
##     progressbar = FALSE
##   )
## 
##   # If the input contains no terms corresponding the to vocabulary used to
##   # generate the vectoriser, then a warning will occur for an empty dtm.
##   # Since this is a plausible scenario, we suppress the warning.
##   suppressWarnings(
##     dtm <- text2vec::create_dtm(tokens, vectoriser)
##   )
## 
##   if (!is.null(tfidf)) {
##     dtm <- tfidf$transform(dtm)
##   }
## 
##   return(dtm)
## }
## <environment: namespace:ModelAsAPackage>

You’ll notice that tfidf argument. This stands for term frequency inverse document frequency. Informally, we want to weight every word as more important if it occurs often, and less important if it occurs in many documents. This is exactly what tfidf does. Let’s start with an unweighted matrix so we can see the effect.

dtm_unweighted <- map_to_dtm(reviews$review,
                             vectoriser = vectoriser,
                             tfidf = NULL)
paste0('> ', reviews$review[3000]) %>% cat

Then, as if I hadn’t wasted enough of my life there, they poured salt in the wound by drawing out the time it took to bring the check.

tail(as.matrix(dtm_unweighted)[3000,], 21)
##         im      excel     becaus    charact        ive    product 
##          0          0          0          0          0          0 
## disappoint  recommend    qualiti       onli       dont       love 
##          0          0          0          0          0          0 
##        bad     realli     servic       food       time      phone 
##          0          0          0          0          1          0 
##       film       movi       veri 
##          0          0          0

Now we fit our tfidf. Be careful here: fit_transform is a method which says “use this data to define the tfidf, and then transform the input by that tfidf. This is distinct from transform which says”use a tfidf that’s already been fitted to transform this data". This terminology is more familiar to Python users than it is to R users, and I occasionally see it tripping people up (especially on Kaggle). The rule of thumb is that you use fit_transform for your training data and transform for your test data, or any new data that you encounter.

tfidf <- text2vec::TfIdf$new()
dtm_tf_idf <- tfidf$fit_transform(dtm_unweighted)

Let’s take the same example before, but now with a weighted document term matrix:

tail(as.matrix(dtm_tf_idf)[3000,], 21)
##         im      excel     becaus    charact        ive    product 
##   0.000000   0.000000   0.000000   0.000000   0.000000   0.000000 
## disappoint  recommend    qualiti       onli       dont       love 
##   0.000000   0.000000   0.000000   0.000000   0.000000   0.000000 
##        bad     realli     servic       food       time      phone 
##   0.000000   0.000000   0.000000   0.000000   1.550546   0.000000 
##       film       movi       veri 
##   0.000000   0.000000   0.000000

Training a random forest

With all of the effort we put into preprocessing, the model training step is relatively straightforward! Document term matrices are stored as a special class of sparse matrix, because there are computational techniques to efficiently store and use matrices in which the vast majority of entries are 0. However, this format isn’t accepted by the randomForest algorithm. Fortunately, with only 3000 rows and 78 columns, we don’t have to worry too much about computational efficiency.

review_rf <- randomForest::randomForest(
  x = as.matrix(dtm_tf_idf),
  y = factor(reviews$sentiment),
  ntree = 500
)

While I don’t want to invest much time in the model itself, we can at least look at how it performs, and which terms it considers the most important:

review_rf
## 
## Call:
##  randomForest(x = as.matrix(dtm_tf_idf), y = factor(reviews$sentiment),      ntree = 500) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 8
## 
##         OOB estimate of  error rate: 33.7%
## Confusion matrix:
##       bad good class.error
## bad  1167  333       0.222
## good  678  822       0.452
randomForest::varImpPlot(review_rf)

Artefact output

Now we recall that we’re actually creating a package. We want all of the work that we’ve done so far to be included in the final result. Fortunately, our custom functions are in the R directory, so they’ll persist when the package is compiled. We need three other objects (all relating to our trained model) to be available as well: our random forest review_rf, our vectoriser, and the tfidf we used to weight our training data (and will reuse for new data). These are all (sparsely) documented with their own entries in the R directory.

usethis::use_data(review_rf,
                  vectoriser,
                  tfidf,
                  pkg = package_root,
                  overwrite = TRUE)