Iβve been playing around with an idea for a new R package. I call it exemplar and hereβs how it works: I provide an example of what data should look like β an exemplar. The package gives a function that checks to make sure that any new data looks the same. The generated function checks β for each column β duplicate values, missing values, ranges, and more.
The validation function doesnβt have any dependencies at all. I need exemplar to generate it, but not to use it.
In this post Iβll give some examples of how it works and what sort of things are validated.
I doubt Iβll ever submit exemplar to CRAN. What Iβve done here isnβt substantial enough to justify a CRAN submission, and itβs a fairly niche tool. Iβm happy to be convinced otherwise, but for now this will stay on Github and can be installed with:
remotes::install_github("mdneuzerling/exemplar")
Iβll also be using the tidyselect package for the examples below. Iβll load that now. Most people never load this package directly, but itβs one of the main components of dplyr.
Some examples
The generated validation functions for data frames can get pretty long, since it includes checks for each column. To keep things brief Iβll check just the wt and mpg columns of mtcars:
exemplar(mtcars, wt, mpg)
#> validate_mtcars <- function(data) {
#> stopifnot(exprs = {
#> is.data.frame(data)
#> # The data is potentially being subsetted so this assertion has been disabled:
#> # identical(colnames(data), c("wt", "mpg"))
#>
#> "wt" %in% colnames(data)
#> is.double(data[["wt"]])
#> !any(is.na(data[["wt"]]) | is.null(data[["wt"]]))
#> # Duplicate values were detected so this assertion has been disabled:
#> # !any(duplicated(data[["wt"]]))
#> min(data[["wt"]], na.rm = TRUE) > 0 # all positive
#> # (Un)comment or modify the below range assertions if needed:
#> # max(data[["wt"]], na.rm = TRUE) <= 5.424
#> # 1.513 <= min(data[["wt"]], na.rm = TRUE)
#> # (Un)comment or modify the below deviance assertions if needed.
#> # The mean is 3.22 and the standard deviation is 0.98:
#> # max(data[["wt"]], na.rm = TRUE) <= 3.22 + 4 * 0.98
#> # 3.22 - 4 * 0.98 <= max(data[["wt"]], na.rm = TRUE)
#>
#> "mpg" %in% colnames(data)
#> is.double(data[["mpg"]])
#> !any(is.na(data[["mpg"]]) | is.null(data[["mpg"]]))
#> # Duplicate values were detected so this assertion has been disabled:
#> # !any(duplicated(data[["mpg"]]))
#> min(data[["mpg"]], na.rm = TRUE) > 0 # all positive
#> # (Un)comment or modify the below range assertions if needed:
#> # max(data[["mpg"]], na.rm = TRUE) <= 33.9
#> # 10.4 <= min(data[["mpg"]], na.rm = TRUE)
#> # (Un)comment or modify the below deviance assertions if needed.
#> # The mean is 20.09 and the standard deviation is 6.03:
#> # max(data[["mpg"]], na.rm = TRUE) <= 20.09 + 4 * 6.03
#> # 20.09 - 4 * 6.03 <= max(data[["mpg"]], na.rm = TRUE)
#> })
#> invisible(TRUE)
#> }
Itβs pretty comprehensive! And the comments explain whatβs going on. I can take this function, modify it, and use it to check any new mtcars-like data.
If any assertion is violated, an error is raised with the offending line of code. If everything checks out then TRUE is returned invisibly. There is a downside here, in that when a single assertion fails the function will not check the rest.
In the above example I only checked the wt and mpg columns. When Iβm validating data I often care about only a few columns. The exemplar function supports tidyselect, just like dplyr. All of the following will work:
exemplar(mtcars, wt, mpg)
exemplar(mtcars, -cyl)
exemplar(mtcars, vs:carb)
exemplar(mtcars, any_of(c("qsec", "notacolumn")))
exemplar(mtcars, starts_with("d"))
The exemplar package also generates validation functions for individual vectors:
exemplar(mtcars$wt)
#> validate_mtcars_wt <- function(data) {
#> stopifnot(exprs = {
#> is.double(data)
#> !any(is.na(data) | is.null(data))
#> # Duplicate values were detected so this assertion has been disabled:
#> # !any(duplicated(data))
#> min(data, na.rm = TRUE) > 0 # all positive
#> # (Un)comment or modify the below range assertions if needed:
#> # max(data, na.rm = TRUE) <= 5.424
#> # 1.513 <= min(data, na.rm = TRUE)
#> # (Un)comment or modify the below deviance assertions if needed.
#> # The mean is 3.22 and the standard deviation is 0.98:
#> # max(data, na.rm = TRUE) <= 3.22 + 4 * 0.98
#> # 3.22 - 4 * 0.98 <= max(data, na.rm = TRUE)
#> })
#> invisible(TRUE)
#> }
Note how the validation function is named after the input. The function name can be specified with the .function_suffix parameter:
exemplar(runif(100, -10, 10), .function_suffix = "random_numbers")
#> validate_random_numbers <- function(data) {
#> stopifnot(exprs = {
#> is.double(data)
#> !any(is.na(data) | is.null(data))
#> !any(duplicated(data))
#> # (Un)comment or modify the below range assertions if needed:
#> # max(data, na.rm = TRUE) <= 9.95231169741601
#> # -9.70273485872895 <= min(data, na.rm = TRUE)
#> # (Un)comment or modify the below deviance assertions if needed.
#> # The mean is 0.59 and the standard deviation is 5.8:
#> # max(data, na.rm = TRUE) <= 0.59 + 4 * 5.8
#> # 0.59 - 4 * 5.8 <= max(data, na.rm = TRUE)
#> })
#> invisible(TRUE)
#> }
Whatβs validated?
The intention is that users will take these validations as a starting point and make adjustments as needed. Some assertions will be commented out by default, with a comment explaining why.
For a vector:
- the data type is first checked
- assertions for no missing or duplicate values are included, but if the input data violates these assertions then the statements will be commented out with an explanation
- parity is checked. If the input is all positive, non-negative, negative, or non-positive, then an assertion for this will be included.
- range assertions and deviance assertions (based on number of standard deviations from the mean, based on the input) are included, but commented out by default.
Alternatively, range assertions can be enabled with the .enable_range_assertions argument and deviance assertions with .enable_deviance_assertions. By default the .allowed_deviance is 4, that is, new data can be within 4 standard deviations of the mean, based on the statistics of the exemplar. This too can be adjusted.
Assertions for a data frame will include assertions for all of the selected columns, and will also check that those columns are present. There is also a validation that those columns are the only columns present, but this will be disabled if exemplar is asked to create an exemplar on a selection of columns in the data frame.
How is this different to other data validation packages?
If I have a clear idea of what to validate in a data frame, then Iβll just write the assertions using assertthat. If those assertions are complicated then Iβll use a package like assertr.
The exemplar package doesnβt provide any additional tools for validating data. In fact, itβs deliberately restricted to base R (β₯ 3.5) to ensure that the generated functions donβt require any installed packages.
What exemplar does do is generate the validation functions automatically, based on an ideal output. This could be useful for, say, machine learning. Perhaps an exemplar is generated on training data and is used to validate test data, or any new data that needs to be scored.
The image at the top of this page is by Tima Miroshnichenko and is used under the terms of the Pexels License.
devtools::session_info()
#> β Session info βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
#> setting value
#> version R version 4.1.0 (2021-05-18)
#> os macOS Big Sur 11.3
#> system aarch64, darwin20
#> ui X11
#> language (EN)
#> collate en_AU.UTF-8
#> ctype en_AU.UTF-8
#> tz Australia/Melbourne
#> date 2022-03-20
#> pandoc 2.11.4 @ /Applications/RStudio.app/Contents/MacOS/pandoc/ (via rmarkdown)
#>
#> β Packages βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
#> package * version date (UTC) lib source
#> backports 1.4.1 2021-12-13 [1] CRAN (R 4.1.1)
#> brio 1.1.3 2021-11-30 [1] CRAN (R 4.1.1)
#> cachem 1.0.6 2021-08-19 [1] CRAN (R 4.1.1)
#> callr 3.7.0 2021-04-20 [1] CRAN (R 4.1.0)
#> cli 3.2.0 2022-02-14 [1] CRAN (R 4.1.1)
#> crayon 1.5.0 2022-02-14 [1] CRAN (R 4.1.1)
#> desc 1.4.0 2021-09-28 [1] CRAN (R 4.1.1)
#> devtools 2.4.3 2021-11-30 [1] CRAN (R 4.1.1)
#> digest 0.6.29 2021-12-01 [1] CRAN (R 4.1.1)
#> downlit 0.4.0 2021-10-29 [1] CRAN (R 4.1.1)
#> ellipsis 0.3.2 2021-04-29 [1] CRAN (R 4.1.0)
#> evaluate 0.14 2019-05-28 [1] CRAN (R 4.1.0)
#> exemplar * 0.0.0.9000 2022-03-20 [1] Github (mdneuzerling/exemplar@19b310b)
#> fansi 1.0.2 2022-01-14 [1] CRAN (R 4.1.1)
#> fastmap 1.1.0 2021-01-25 [1] CRAN (R 4.1.0)
#> fs 1.5.2 2021-12-08 [1] CRAN (R 4.1.1)
#> glue 1.6.2 2022-02-24 [1] CRAN (R 4.1.0)
#> htmltools 0.5.2 2021-08-25 [1] CRAN (R 4.1.1)
#> hugodown 0.0.0.9000 2021-09-18 [1] Github (r-lib/hugodown@168a361)
#> knitr 1.37 2021-12-16 [1] CRAN (R 4.1.1)
#> lifecycle 1.0.1 2021-09-24 [1] CRAN (R 4.1.1)
#> magrittr 2.0.2 2022-01-26 [1] CRAN (R 4.1.1)
#> memoise 2.0.1 2021-11-26 [1] CRAN (R 4.1.1)
#> pillar 1.7.0 2022-02-01 [1] CRAN (R 4.1.1)
#> pkgbuild 1.3.1 2021-12-20 [1] CRAN (R 4.1.1)
#> pkgconfig 2.0.3 2019-09-22 [1] CRAN (R 4.1.0)
#> pkgload 1.2.4 2021-11-30 [1] CRAN (R 4.1.1)
#> prettycode 1.1.0 2019-12-16 [1] CRAN (R 4.1.0)
#> prettyunits 1.1.1 2020-01-24 [1] CRAN (R 4.1.0)
#> processx 3.5.2 2021-04-30 [1] CRAN (R 4.1.0)
#> ps 1.6.0 2021-02-28 [1] CRAN (R 4.1.0)
#> purrr 0.3.4 2020-04-17 [1] CRAN (R 4.1.0)
#> R.cache 0.15.0 2021-04-30 [1] CRAN (R 4.1.0)
#> R.methodsS3 1.8.1 2020-08-26 [1] CRAN (R 4.1.0)
#> R.oo 1.24.0 2020-08-26 [1] CRAN (R 4.1.0)
#> R.utils 2.11.0 2021-09-26 [1] CRAN (R 4.1.0)
#> R6 2.5.1 2021-08-19 [1] CRAN (R 4.1.1)
#> rematch2 2.1.2 2020-05-01 [1] CRAN (R 4.1.0)
#> remotes 2.4.2 2021-11-30 [1] CRAN (R 4.1.1)
#> rlang 1.0.1 2022-02-03 [1] CRAN (R 4.1.1)
#> rmarkdown 2.11 2021-09-14 [1] CRAN (R 4.1.1)
#> rprojroot 2.0.2 2020-11-15 [1] CRAN (R 4.1.0)
#> rstudioapi 0.13 2020-11-12 [1] CRAN (R 4.1.0)
#> sessioninfo 1.2.2 2021-12-06 [1] CRAN (R 4.1.1)
#> stringi 1.7.6 2021-11-29 [1] CRAN (R 4.1.1)
#> stringr 1.4.0 2019-02-10 [1] CRAN (R 4.1.0)
#> styler 1.6.2 2021-09-23 [1] CRAN (R 4.1.0)
#> testthat 3.1.2 2022-01-20 [1] CRAN (R 4.1.0)
#> tibble 3.1.6 2021-11-07 [1] CRAN (R 4.1.1)
#> tidyselect * 1.1.2 2022-02-21 [1] CRAN (R 4.1.1)
#> usethis 2.1.5 2021-12-09 [1] CRAN (R 4.1.1)
#> utf8 1.2.2 2021-07-24 [1] CRAN (R 4.1.0)
#> vctrs 0.3.8 2021-04-29 [1] CRAN (R 4.1.0)
#> withr 2.4.3 2021-11-30 [1] CRAN (R 4.1.1)
#> xfun 0.29 2021-12-14 [1] CRAN (R 4.1.1)
#> yaml 2.2.1 2020-02-01 [1] CRAN (R 4.1.0)
#>
#> [1] /Library/Frameworks/R.framework/Versions/4.1-arm64/Resources/library
#>
#> ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ