Exemplar: a prototype R package for data validation

2022-03-20

I’ve been playing around with an idea for a new R package. I call it exemplar and here’s how it works: I provide an example of what data should look like — an exemplar. The package gives a function that checks to make sure that any new data looks the same. The generated function checks — for each column — duplicate values, missing values, ranges, and more.

The validation function doesn’t have any dependencies at all. I need exemplar to generate it, but not to use it.

In this post I’ll give some examples of how it works and what sort of things are validated.

I doubt I’ll ever submit exemplar to CRAN. What I’ve done here isn’t substantial enough to justify a CRAN submission, and it’s a fairly niche tool. I’m happy to be convinced otherwise, but for now this will stay on Github and can be installed with:

remotes::install_github("mdneuzerling/exemplar")

I’ll also be using the tidyselect package for the examples below. I’ll load that now. Most people never load this package directly, but it’s one of the main components of dplyr.

library(tidyselect)

Some examples

The generated validation functions for data frames can get pretty long, since it includes checks for each column. To keep things brief I’ll check just the wt and mpg columns of mtcars:

exemplar(mtcars, wt, mpg)
#> validate_mtcars <- function(data) {
#>   stopifnot(exprs = {
#>     is.data.frame(data)
#>     # The data is potentially being subsetted so this assertion has been disabled:
#>     # identical(colnames(data), c("wt", "mpg"))
#> 
#>     "wt" %in% colnames(data)
#>     is.double(data[["wt"]])
#>     !any(is.na(data[["wt"]]) | is.null(data[["wt"]]))
#>     # Duplicate values were detected so this assertion has been disabled:
#>     # !any(duplicated(data[["wt"]]))
#>     min(data[["wt"]], na.rm = TRUE) > 0 # all positive
#>     # (Un)comment or modify the below range assertions if needed:
#>     # max(data[["wt"]], na.rm = TRUE) <= 5.424
#>     # 1.513 <= min(data[["wt"]], na.rm = TRUE)
#>     # (Un)comment or modify the below deviance assertions if needed.
#>     # The mean is 3.22 and the standard deviation is 0.98:
#>     # max(data[["wt"]], na.rm = TRUE) <= 3.22 + 4 * 0.98
#>     # 3.22 - 4 * 0.98 <= max(data[["wt"]], na.rm = TRUE)
#> 
#>     "mpg" %in% colnames(data)
#>     is.double(data[["mpg"]])
#>     !any(is.na(data[["mpg"]]) | is.null(data[["mpg"]]))
#>     # Duplicate values were detected so this assertion has been disabled:
#>     # !any(duplicated(data[["mpg"]]))
#>     min(data[["mpg"]], na.rm = TRUE) > 0 # all positive
#>     # (Un)comment or modify the below range assertions if needed:
#>     # max(data[["mpg"]], na.rm = TRUE) <= 33.9
#>     # 10.4 <= min(data[["mpg"]], na.rm = TRUE)
#>     # (Un)comment or modify the below deviance assertions if needed.
#>     # The mean is 20.09 and the standard deviation is 6.03:
#>     # max(data[["mpg"]], na.rm = TRUE) <= 20.09 + 4 * 6.03
#>     # 20.09 - 4 * 6.03 <= max(data[["mpg"]], na.rm = TRUE)
#>   })
#>   invisible(TRUE)
#> }

It’s pretty comprehensive! And the comments explain what’s going on. I can take this function, modify it, and use it to check any new mtcars-like data.

If any assertion is violated, an error is raised with the offending line of code. If everything checks out then TRUE is returned invisibly. There is a downside here, in that when a single assertion fails the function will not check the rest.

In the above example I only checked the wt and mpg columns. When I’m validating data I often care about only a few columns. The exemplar function supports tidyselect, just like dplyr. All of the following will work:

exemplar(mtcars, wt, mpg)
exemplar(mtcars, -cyl)
exemplar(mtcars, vs:carb)
exemplar(mtcars, any_of(c("qsec", "notacolumn")))
exemplar(mtcars, starts_with("d"))

The exemplar package also generates validation functions for individual vectors:

exemplar(mtcars$wt)
#> validate_mtcars_wt <- function(data) {
#>   stopifnot(exprs = {
#>     is.double(data)
#>     !any(is.na(data) | is.null(data))
#>     # Duplicate values were detected so this assertion has been disabled:
#>     # !any(duplicated(data))
#>     min(data, na.rm = TRUE) > 0 # all positive
#>     # (Un)comment or modify the below range assertions if needed:
#>     # max(data, na.rm = TRUE) <= 5.424
#>     # 1.513 <= min(data, na.rm = TRUE)
#>     # (Un)comment or modify the below deviance assertions if needed.
#>     # The mean is 3.22 and the standard deviation is 0.98:
#>     # max(data, na.rm = TRUE) <= 3.22 + 4 * 0.98
#>     # 3.22 - 4 * 0.98 <= max(data, na.rm = TRUE)
#>   })
#>   invisible(TRUE)
#> }

Note how the validation function is named after the input. The function name can be specified with the .function_suffix parameter:

exemplar(runif(100, -10, 10), .function_suffix = "random_numbers")
#> validate_random_numbers <- function(data) {
#>   stopifnot(exprs = {
#>     is.double(data)
#>     !any(is.na(data) | is.null(data))
#>     !any(duplicated(data))
#>     # (Un)comment or modify the below range assertions if needed:
#>     # max(data, na.rm = TRUE) <= 9.95231169741601
#>     # -9.70273485872895 <= min(data, na.rm = TRUE)
#>     # (Un)comment or modify the below deviance assertions if needed.
#>     # The mean is 0.59 and the standard deviation is 5.8:
#>     # max(data, na.rm = TRUE) <= 0.59 + 4 * 5.8
#>     # 0.59 - 4 * 5.8 <= max(data, na.rm = TRUE)
#>   })
#>   invisible(TRUE)
#> }

What’s validated?

The intention is that users will take these validations as a starting point and make adjustments as needed. Some assertions will be commented out by default, with a comment explaining why.

For a vector:

the data type is first checked
assertions for no missing or duplicate values are included, but if the input data violates these assertions then the statements will be commented out with an explanation
parity is checked. If the input is all positive, non-negative, negative, or non-positive, then an assertion for this will be included.
range assertions and deviance assertions (based on number of standard deviations from the mean, based on the input) are included, but commented out by default.

Alternatively, range assertions can be enabled with the .enable_range_assertions argument and deviance assertions with .enable_deviance_assertions. By default the .allowed_deviance is 4, that is, new data can be within 4 standard deviations of the mean, based on the statistics of the exemplar. This too can be adjusted.

Assertions for a data frame will include assertions for all of the selected columns, and will also check that those columns are present. There is also a validation that those columns are the only columns present, but this will be disabled if exemplar is asked to create an exemplar on a selection of columns in the data frame.

How is this different to other data validation packages?

If I have a clear idea of what to validate in a data frame, then I’ll just write the assertions using assertthat. If those assertions are complicated then I’ll use a package like assertr.

The exemplar package doesn’t provide any additional tools for validating data. In fact, it’s deliberately restricted to base R (≥ 3.5) to ensure that the generated functions don’t require any installed packages.

What exemplar does do is generate the validation functions automatically, based on an ideal output. This could be useful for, say, machine learning. Perhaps an exemplar is generated on training data and is used to validate test data, or any new data that needs to be scored.

The image at the top of this page is by Tima Miroshnichenko and is used under the terms of the Pexels License.

devtools::session_info()
#> ─ Session info ───────────────────────────────────────────────────────────────
#>  setting  value
#>  version  R version 4.1.0 (2021-05-18)
#>  os       macOS Big Sur 11.3
#>  system   aarch64, darwin20
#>  ui       X11
#>  language (EN)
#>  collate  en_AU.UTF-8
#>  ctype    en_AU.UTF-8
#>  tz       Australia/Melbourne
#>  date     2022-03-20
#>  pandoc   2.11.4 @ /Applications/RStudio.app/Contents/MacOS/pandoc/ (via rmarkdown)
#> 
#> ─ Packages ───────────────────────────────────────────────────────────────────
#>  package     * version    date (UTC) lib source
#>  backports     1.4.1      2021-12-13 [1] CRAN (R 4.1.1)
#>  brio          1.1.3      2021-11-30 [1] CRAN (R 4.1.1)
#>  cachem        1.0.6      2021-08-19 [1] CRAN (R 4.1.1)
#>  callr         3.7.0      2021-04-20 [1] CRAN (R 4.1.0)
#>  cli           3.2.0      2022-02-14 [1] CRAN (R 4.1.1)
#>  crayon        1.5.0      2022-02-14 [1] CRAN (R 4.1.1)
#>  desc          1.4.0      2021-09-28 [1] CRAN (R 4.1.1)
#>  devtools      2.4.3      2021-11-30 [1] CRAN (R 4.1.1)
#>  digest        0.6.29     2021-12-01 [1] CRAN (R 4.1.1)
#>  downlit       0.4.0      2021-10-29 [1] CRAN (R 4.1.1)
#>  ellipsis      0.3.2      2021-04-29 [1] CRAN (R 4.1.0)
#>  evaluate      0.14       2019-05-28 [1] CRAN (R 4.1.0)
#>  exemplar    * 0.0.0.9000 2022-03-20 [1] Github (mdneuzerling/exemplar@19b310b)
#>  fansi         1.0.2      2022-01-14 [1] CRAN (R 4.1.1)
#>  fastmap       1.1.0      2021-01-25 [1] CRAN (R 4.1.0)
#>  fs            1.5.2      2021-12-08 [1] CRAN (R 4.1.1)
#>  glue          1.6.2      2022-02-24 [1] CRAN (R 4.1.0)
#>  htmltools     0.5.2      2021-08-25 [1] CRAN (R 4.1.1)
#>  hugodown      0.0.0.9000 2021-09-18 [1] Github (r-lib/hugodown@168a361)
#>  knitr         1.37       2021-12-16 [1] CRAN (R 4.1.1)
#>  lifecycle     1.0.1      2021-09-24 [1] CRAN (R 4.1.1)
#>  magrittr      2.0.2      2022-01-26 [1] CRAN (R 4.1.1)
#>  memoise       2.0.1      2021-11-26 [1] CRAN (R 4.1.1)
#>  pillar        1.7.0      2022-02-01 [1] CRAN (R 4.1.1)
#>  pkgbuild      1.3.1      2021-12-20 [1] CRAN (R 4.1.1)
#>  pkgconfig     2.0.3      2019-09-22 [1] CRAN (R 4.1.0)
#>  pkgload       1.2.4      2021-11-30 [1] CRAN (R 4.1.1)
#>  prettycode    1.1.0      2019-12-16 [1] CRAN (R 4.1.0)
#>  prettyunits   1.1.1      2020-01-24 [1] CRAN (R 4.1.0)
#>  processx      3.5.2      2021-04-30 [1] CRAN (R 4.1.0)
#>  ps            1.6.0      2021-02-28 [1] CRAN (R 4.1.0)
#>  purrr         0.3.4      2020-04-17 [1] CRAN (R 4.1.0)
#>  R.cache       0.15.0     2021-04-30 [1] CRAN (R 4.1.0)
#>  R.methodsS3   1.8.1      2020-08-26 [1] CRAN (R 4.1.0)
#>  R.oo          1.24.0     2020-08-26 [1] CRAN (R 4.1.0)
#>  R.utils       2.11.0     2021-09-26 [1] CRAN (R 4.1.0)
#>  R6            2.5.1      2021-08-19 [1] CRAN (R 4.1.1)
#>  rematch2      2.1.2      2020-05-01 [1] CRAN (R 4.1.0)
#>  remotes       2.4.2      2021-11-30 [1] CRAN (R 4.1.1)
#>  rlang         1.0.1      2022-02-03 [1] CRAN (R 4.1.1)
#>  rmarkdown     2.11       2021-09-14 [1] CRAN (R 4.1.1)
#>  rprojroot     2.0.2      2020-11-15 [1] CRAN (R 4.1.0)
#>  rstudioapi    0.13       2020-11-12 [1] CRAN (R 4.1.0)
#>  sessioninfo   1.2.2      2021-12-06 [1] CRAN (R 4.1.1)
#>  stringi       1.7.6      2021-11-29 [1] CRAN (R 4.1.1)
#>  stringr       1.4.0      2019-02-10 [1] CRAN (R 4.1.0)
#>  styler        1.6.2      2021-09-23 [1] CRAN (R 4.1.0)
#>  testthat      3.1.2      2022-01-20 [1] CRAN (R 4.1.0)
#>  tibble        3.1.6      2021-11-07 [1] CRAN (R 4.1.1)
#>  tidyselect  * 1.1.2      2022-02-21 [1] CRAN (R 4.1.1)
#>  usethis       2.1.5      2021-12-09 [1] CRAN (R 4.1.1)
#>  utf8          1.2.2      2021-07-24 [1] CRAN (R 4.1.0)
#>  vctrs         0.3.8      2021-04-29 [1] CRAN (R 4.1.0)
#>  withr         2.4.3      2021-11-30 [1] CRAN (R 4.1.1)
#>  xfun          0.29       2021-12-14 [1] CRAN (R 4.1.1)
#>  yaml          2.2.1      2020-02-01 [1] CRAN (R 4.1.0)
#> 
#>  [1] /Library/Frameworks/R.framework/Versions/4.1-arm64/Resources/library
#> 
#> ──────────────────────────────────────────────────────────────────────────────