First Impressions of Julia from an R User

2020-09-16

It’s no secret that I love R and begrudgingly use Python. But there’s a another option for data science, and it promises the speed of C with the ease of use of R/Python. That language is Julia, and it’s a delight to use. I took some time to learn the basics, and I’m sharing my impressions here.

Julia is not the most popular language in the world

Before I go on, there’s one thing I want to stress here: Julia is not as popular as Python or R for doing stuff with data. It doesn’t have the vast library of packages/modules that the veteran languages have built up over decades.

If this is a deal-breaker, then don’t worry about Julia. If you want to use the languages with the most packages, or what looks best on your resume, then Python or R are better options. I’m not disparaging you either; these are good reasons.

But if you — like me — find that writing code to do stuff with data is fun, then it’s well worth checking out Julia. I just wanted to get that common objection out of the way first.

Multiple dispatch

People are often surprised to discover that R uses a lot of object-oriented programming. R objects store their classes as vectors, and can be accessed (and modified) with the class function:

# R
a_tibble <- tibble::tibble(x = c(1, 2, 3), y = c(4, 5, 6))
class(a_tibble)

#> [1] "tbl_df"     "tbl"        "data.frame"

In R, the main implementation of object-oriented programming is through a system called S3. When I call a generic function, like print, R looks at the class of the first argument to determine which print method to use. If I print a data frame, then print.data.frame is used. If I print a linear model, then print.lm is used. And there are a lot of print methods:

# R
sloop::s3_methods_generic("print")

#> # A tibble: 322 x 4
#>    generic class       visible source             
#>    <chr>   <chr>       <lgl>   <chr>              
#>  1 print   acf         FALSE   registered S3method
#>  2 print   AES         FALSE   registered S3method
#>  3 print   all_vars    FALSE   registered S3method
#>  4 print   anova       FALSE   registered S3method
#>  5 print   ansi_string FALSE   registered S3method
#>  6 print   ansi_style  FALSE   registered S3method
#>  7 print   any_vars    FALSE   registered S3method
#>  8 print   aov         FALSE   registered S3method
#>  9 print   aovlist     FALSE   registered S3method
#> 10 print   ar          FALSE   registered S3method
#> # … with 312 more rows

This process of figuring out which method to use is called dispatch. S3 is a single dispatch system: only the classes of the first argument to the function are used to determine which method to call¹. Julia takes it a step further with multiple dispatch. And it works on types, not classes.

A simple example of multiple dispatch: animals

Here’s a simple example of multiple dispatch². I’ll create an abstract type in Julia, called Animal:

# Julia
abstract type Animal end

I’ll now create two subtypes of animal: Cat and Dog. These will be each be a struct, which is a type composed of other types. In this case, each of the two new types will have a Name of type String, and Age of type Int:

struct Cat <: Animal
    Name::String 
    Age::Int
end

struct Dog <: Animal
    Name::String 
    Age::Int
end

Now I’ll define an interaction function which will output a string describing the interaction of two animals. Actually, I’ll define 4 methods for this function. The method that’s called will depend upon the types of both arguments:

interaction(x::Cat, y::Cat) = "meow"


#> interaction (generic function with 1 method)

interaction(x::Dog, y::Dog) = "sniff"


#> interaction (generic function with 2 methods)

interaction(x::Cat, y::Dog) = "growl"


#> interaction (generic function with 3 methods)

interaction(x::Dog, y::Cat) = interaction(y, x)


#> interaction (generic function with 4 methods)

In any other language, it would look like I’m defining a function and then overwriting it three times. But with Julia, functions are unique up to name and type signature. interaction(x::Cat, y::Cat) has a type signature of (Cat, Cat) and interaction(x::Dog, y::Dog) has a type signature of (Dog, Dog). So instead of overwriting the function I simply add a new generic each time.

I’ll define some cats and dogs based on those I know around my neighbourhood, such as the friendly golden retriever Hudson who always says good morning to me, or the stylish mini-whippet Phoebe who has a new outfit every time I see her. The semicolon here tells Julia to suppress the output:

luna = Cat("Luna", 1);
pip = Cat("Pip", 5);
hudson = Dog("Hudson", 4);
phoebe = Dog("Phoebe", 1);

And now lets see how these animals would interact:

interaction(luna, pip)


#> "meow"

interaction(pip, luna)


#> "meow"

interaction(hudson, phoebe)


#> "sniff"

interaction(luna, phoebe)


#> "growl"

Just as expected! These animals have different interactions based on their types.

Of course, I could accomplish the same thing with single dispatch and some if-else statements. But if I do that and I later want to extend the interactions to cover other animals, I have to go back and change those if-else statements. With Julia, it’s just a matter of defining new functions.

Going up the type hierarchy

Earlier I defined an Animal type, which is a supertype of Cat and Dog. Julia will always use the most specific method it can find. If one doesn’t exist for the exact types, then it will consider supertypes. I can demonstrate this by creating a generic with type signature (Animal, Animal), to be used when no more specific function can be found:

struct Gazelle <: Animal
    Name::String 
    Age::Int
end;

bob = Gazelle("Bob", 2);
interaction(x::Animal, y::Animal) = "flee";
interaction(hudson, bob)


#> "flee"

I haven’t defined a method like interaction(x::Dog, y::Gazelle), so Julia goes up the type hierarchy to find interaction(x::Animal, y::Animal) instead.

Multiple dispatch is a core concept of Julia, and one of the main reasons that it’s so much faster than R or Python. Julia can compile fast functions for each type signature. And there are a lot of type signatures to consider. On my machine I count 184 methods for the + operator!

The power of macros

I’ve become so used to R’s metaprogramming features that I think I would struggle with any language that doesn’t let me treat code as data to be manipulated. Julia delivers in the form of macros.

I’ll give an example. In most languages, the or logical operator is short-circuited: If the first argument is true, then the second argument isn’t evaluated. This behaviour exists in R with its || short-circuited operator. If the first argument to || is TRUE, then R doesn’t evaluate the second argument. I can even make the second argument something that throws an error and R won’t complain, as long as the first argument is TRUE:

# R
stop("Oh no!") || TRUE

#> Error in eval(expr, envir, enclos): Oh no!

TRUE || stop("Oh no!")

#> [1] TRUE

Suppose I wanted to create a backwards-or function, bor, that does the same thing but evaluates the second argument first. That is, I want a function bor(x, y) that acts just like ||, but doesn’t evaluate x if y is TRUE. This is pretty easy in R, and I don’t even have to take advantage of metaprogramming, since R is lazily-evaluated:

# R
bor <- function(x, y) y || x

bor(TRUE, stop("Oh no!"))

#> Error in bor(TRUE, stop("Oh no!")): Oh no!

bor(stop("Oh no!"), TRUE)

#> [1] TRUE

Julia code is evaluated eagerly, so this won’t work:

# Julia
bor(x, y) = y || x
bor(error("Oh no!"), true)

# ERROR: Oh no!
# Stacktrace:
#  [1] error(::String) at ./error.jl:33
#  [2] top-level scope at REPL[38]:1
#  [3] include_string(::Function, ::Module, ::String, ::String) at ./loading.jl:1088

A macro, however, lets me move code around before it is evaluated. Macros start with the @ symbol:

macro bor(a, b) 
    return :($b || $a)
end;
@bor(error("Oh no!"), true)


#> true

The : and $ symbols are the metaprogramming power here. The : prefix converts to a symbol or expression, whereas $ evaluates or interpolates the expression. This is somewhat analogous to base R’s quote and eval functions.

These two symbols can even be combined to perform what in R is sometimes called quasiquotation, where some things are quoted but others are explicitly evaluated:

a = 1;
:($a + b)


#> :(1 + b)

Reading in CSV data

Now that I’ve laid down some core concepts of Julia, I’ll share some of my experiences with using the language for the first time.

The first thing that impressed me was that Julia comes with its own package-management system: a package called Pkg. I tried to load the DataFrames package when it wasn’t installed and Julia gave me the explicit command for installing it. Excellent!

Reading CSVs into a data frame works exactly like you would expect. It’s pretty darn fast, though; I was able to load in a 2GB CSV of StackExchange data in under 5 seconds, and my machine isn’t very powerful.

using DataFrames
using CSV

questions = CSV.read("Questions.csv")


#> 1264216×7 DataFrame. Omitted printing of 4 columns
#> │ Row     │ Id       │ OwnerUserId │ CreationDate         │
#> │         │ Int64    │ String      │ String               │
#> ├─────────┼──────────┼─────────────┼──────────────────────┤
#> │ 1       │ 80       │ 26          │ 2008-08-01T13:57:07Z │
#> │ 2       │ 90       │ 58          │ 2008-08-01T14:41:24Z │
#> │ 3       │ 120      │ 83          │ 2008-08-01T15:50:08Z │
#> │ 4       │ 180      │ 2089740     │ 2008-08-01T18:42:19Z │
#> │ 5       │ 260      │ 91          │ 2008-08-01T23:22:08Z │
#> │ 6       │ 330      │ 63          │ 2008-08-02T02:51:36Z │
#> │ 7       │ 470      │ 71          │ 2008-08-02T15:11:47Z │
#> ⋮
#> │ 1264209 │ 40143150 │ 5496690     │ 2016-10-19T23:31:41Z │
#> │ 1264210 │ 40143170 │ 2010246     │ 2016-10-19T23:33:42Z │
#> │ 1264211 │ 40143190 │ 333403      │ 2016-10-19T23:36:01Z │
#> │ 1264212 │ 40143210 │ 5610777     │ 2016-10-19T23:38:01Z │
#> │ 1264213 │ 40143300 │ 3791161     │ 2016-10-19T23:48:09Z │
#> │ 1264214 │ 40143340 │ 7028647     │ 2016-10-19T23:52:50Z │
#> │ 1264215 │ 40143360 │ 871677      │ 2016-10-19T23:55:24Z │
#> │ 1264216 │ 40143380 │ 6823982     │ 2016-10-19T23:57:31Z │

I like how data frames print in Julia, with the type below the column name. Something that’s missing here is a list of columns which have been omitted from printing, similar to the behaviour of the tibble package in R. And the data frame truncation is a little aggressive; sometimes columns are not printed even though there’s room.

The CSV.read function uses multiple threads by default. I couldn’t get this to work on my machine, and I had to restrict it to one thread. I raised an issue on the GitHub page, and a maintainer came along to explain what was going on and implement a fix! What more can you ask for?

Data frames

Basic data frame functions are very similar to those in R:

nrow(questions)


#> 1264216

ncol(questions)


#> 7

head(questions)


#> 6×7 DataFrame. Omitted printing of 3 columns
#> │ Row │ Id    │ OwnerUserId │ CreationDate         │ ClosedDate           │
#> │     │ Int64 │ String      │ String               │ String               │
#> ├─────┼───────┼─────────────┼──────────────────────┼──────────────────────┤
#> │ 1   │ 80    │ 26          │ 2008-08-01T13:57:07Z │ NA                   │
#> │ 2   │ 90    │ 58          │ 2008-08-01T14:41:24Z │ 2012-12-26T03:45:49Z │
#> │ 3   │ 120   │ 83          │ 2008-08-01T15:50:08Z │ NA                   │
#> │ 4   │ 180   │ 2089740     │ 2008-08-01T18:42:19Z │ NA                   │
#> │ 5   │ 260   │ 91          │ 2008-08-01T23:22:08Z │ NA                   │
#> │ 6   │ 330   │ 63          │ 2008-08-02T02:51:36Z │ NA                   │

The describe function, analogous to R’s summary function`, is also quite nice. It presents a bit more information than the R equivalent, and it returns another data frame:

describe(questions)


#> 7×8 DataFrame. Omitted printing of 6 columns
#> │ Row │ variable     │ mean      │
#> │     │ Symbol       │ Union…    │
#> ├─────┼──────────────┼───────────┤
#> │ 1   │ Id           │ 2.13275e7 │
#> │ 2   │ OwnerUserId  │           │
#> │ 3   │ CreationDate │           │
#> │ 4   │ ClosedDate   │           │
#> │ 5   │ Score        │ 1.78154   │
#> │ 6   │ Title        │           │
#> │ 7   │ Body         │           │

And I can retrieve an individual column (as an array) with either questions.Score or questions["Score"].

Piping

Julia supports piping. Thank goodness, too, because the tidyverse has spoilt me.

There’s a native pipe |> which supposedly puts the value on the left as the first argument to the function on the right, but I found it a bit finnicky. The Pipe package improves this substantially by allowing the use of a _ placeholder, for example, @pipe 4 |> sqrt(_). This also opens up the possibility of piping into an argument other than the first. The only downside here is that a chain of piped functions must begin with the @pipe macro.

It would be neat if this @pipe macro were incorporated into base Julia, but on more than one occasion I’ve seen serious discussions in Julia for removing the base |> operator altogether. I hope that it stays, because a chain of piped functions is a beautiful sight.

Data manipulation

I had a bit of trouble with DataFrames syntax. The easiest way I could find to manipulate data was to modify-in-place with a series of reassignments. I’ll give an example. My questions data frame has a Score column. Suppose I want to scale it by 100, and then select all scores above 50 (this is some fairly arbitrary data manipulation). I then want to describe the data:

questions2 = copy(questions);
questions2.Score = questions2.Score * 100;
questions2 = questions2[questions2.Score .> 50, :];
describe(questions2)


#> 7×8 DataFrame. Omitted printing of 6 columns
#> │ Row │ variable     │ mean      │
#> │     │ Symbol       │ Union…    │
#> ├─────┼──────────────┼───────────┤
#> │ 1   │ Id           │ 1.84611e7 │
#> │ 2   │ OwnerUserId  │           │
#> │ 3   │ CreationDate │           │
#> │ 4   │ ClosedDate   │           │
#> │ 5   │ Score        │ 403.745   │
#> │ 6   │ Title        │           │
#> │ 7   │ Body         │           │

I had a much easier time with the DataFramesMeta package. This provides is an excellent, pipe-friendly package that implements some dplyr-like functionality, using macros:

using Pipe
using DataFramesMeta

@pipe questions |> 
  @transform(_, Score = 100 * :Score) |>
  @where(_, :Score .> 50) |>
  describe(_)


#> 7×8 DataFrame. Omitted printing of 6 columns
#> │ Row │ variable     │ mean      │
#> │     │ Symbol       │ Union…    │
#> ├─────┼──────────────┼───────────┤
#> │ 1   │ Id           │ 1.84611e7 │
#> │ 2   │ OwnerUserId  │           │
#> │ 3   │ CreationDate │           │
#> │ 4   │ ClosedDate   │           │
#> │ 5   │ Score        │ 403.745   │
#> │ 6   │ Title        │           │
#> │ 7   │ Body         │           │

Syntactic conventions

In the above chain, I used .> to filter for values greater than 50. Julia functions and operators implement some very handy syntactic conventions that make life easier, and this is one of them. Vectorised operations are prefixed (or sometimes suffixed) with a dot:

x = ["a", "b", "c"]; y = ["a", "e", "c"];

x == y


#> false

x .== y


#> 3-element BitArray{1}:
#>  1
#>  0
#>  1

(2020-09-16 correction: Ari Katz points out that the dot is not a convention but an actual feature of the language that vectorises operators and functions in a fast and memory-efficient way. See this post for more details.)

Another convention is that functions that modify in place use an exclamation mark. For example, df = sort(df) is equivalent to sort!(df). This often trips me up in Python, so I felt genuine relief to find that in Julia I don’t have to guess whether or not a function will modify an argument in-place.

The real killer feature of Julia

I’ve glossed over subjects like speed and compilation because these are areas which I’m not confident discussing. The rough idea is that Julia is fast, and the compiled code is often very similar to that of C. But a really powerful consequence of this is that a user doesn’t need to learn another language to become a contributor.

If I want to write a new package/module for R/Python, and my use-case requires speedy code, then I’m pretty much obliged to drop down to C/C++. I have the original learning curve for the language I actually want to use, followed by a second learning curve for the lower language. With Julia it’s just… Julia. All Julia. This means that you can have a machine learning library that’s 100% Julia code.

Whenever I read the source code of an R/Python package and I see a reference to a file that ends with “.c” or “.h” I panic a little. Julia’s killer feature is removing that panic moment. Knowing Julia is enough to use Julia.

Just give this language a go

devtools::session_info()

#> ─ Session info ───────────────────────────────────────────────────────────────
#>  setting  value                       
#>  version  R version 4.0.0 (2020-04-24)
#>  os       Ubuntu 20.04.1 LTS          
#>  system   x86_64, linux-gnu           
#>  ui       X11                         
#>  language en_AU:en                    
#>  collate  en_AU.UTF-8                 
#>  ctype    en_AU.UTF-8                 
#>  tz       Australia/Melbourne         
#>  date     2020-09-16                  
#> 
#> ─ Packages ───────────────────────────────────────────────────────────────────
#>  package     * version    date       lib source                            
#>  assertthat    0.2.1      2019-03-21 [1] CRAN (R 4.0.0)                    
#>  backports     1.1.9      2020-08-24 [1] CRAN (R 4.0.0)                    
#>  blob          1.2.1      2020-01-20 [1] CRAN (R 4.0.0)                    
#>  broom         0.7.0      2020-07-09 [1] CRAN (R 4.0.0)                    
#>  callr         3.4.4      2020-09-07 [1] CRAN (R 4.0.0)                    
#>  cellranger    1.1.0      2016-07-27 [1] CRAN (R 4.0.0)                    
#>  cli           2.0.2      2020-02-28 [1] CRAN (R 4.0.0)                    
#>  colorspace    1.4-1      2019-03-18 [1] CRAN (R 4.0.0)                    
#>  crayon        1.3.4      2017-09-16 [1] CRAN (R 4.0.0)                    
#>  DBI           1.1.0      2019-12-15 [1] CRAN (R 4.0.0)                    
#>  dbplyr        1.4.4      2020-05-27 [1] CRAN (R 4.0.0)                    
#>  desc          1.2.0      2018-05-01 [1] CRAN (R 4.0.0)                    
#>  devtools      2.3.0      2020-04-10 [1] CRAN (R 4.0.0)                    
#>  digest        0.6.25     2020-02-23 [1] CRAN (R 4.0.0)                    
#>  downlit       0.1.0.9000 2020-09-15 [1] Github (r-lib/downlit@e420a84)    
#>  dplyr       * 1.0.2      2020-08-18 [1] CRAN (R 4.0.0)                    
#>  ellipsis      0.3.1      2020-05-15 [1] CRAN (R 4.0.0)                    
#>  evaluate      0.14       2019-05-28 [1] CRAN (R 4.0.0)                    
#>  fansi         0.4.1      2020-01-08 [1] CRAN (R 4.0.0)                    
#>  forcats     * 0.5.0      2020-03-01 [1] CRAN (R 4.0.0)                    
#>  fs            1.5.0      2020-07-31 [1] CRAN (R 4.0.0)                    
#>  generics      0.0.2      2018-11-29 [1] CRAN (R 4.0.0)                    
#>  ggplot2     * 3.3.2.9000 2020-08-07 [1] Github (tidyverse/ggplot2@6d91349)
#>  glue          1.4.2      2020-08-27 [1] CRAN (R 4.0.0)                    
#>  gtable        0.3.0      2019-03-25 [1] CRAN (R 4.0.0)                    
#>  haven         2.2.0      2019-11-08 [1] CRAN (R 4.0.0)                    
#>  hms           0.5.3      2020-01-08 [1] CRAN (R 4.0.0)                    
#>  htmltools     0.5.0      2020-06-16 [1] CRAN (R 4.0.0)                    
#>  httr          1.4.2      2020-07-20 [1] CRAN (R 4.0.0)                    
#>  hugodown      0.0.0.9000 2020-09-15 [1] Github (r-lib/hugodown@e4c6737)   
#>  jsonlite      1.7.0      2020-06-25 [1] CRAN (R 4.0.0)                    
#>  JuliaCall     0.17.1     2019-11-27 [1] CRAN (R 4.0.0)                    
#>  knitr         1.29       2020-06-23 [1] CRAN (R 4.0.0)                    
#>  lattice       0.20-41    2020-04-02 [4] CRAN (R 4.0.0)                    
#>  lifecycle     0.2.0      2020-03-06 [1] CRAN (R 4.0.0)                    
#>  lubridate     1.7.9      2020-06-08 [1] CRAN (R 4.0.0)                    
#>  magrittr      1.5        2014-11-22 [1] CRAN (R 4.0.0)                    
#>  Matrix        1.2-18     2019-11-27 [4] CRAN (R 4.0.0)                    
#>  memoise       1.1.0.9000 2020-05-09 [1] Github (hadley/memoise@4aefd9f)   
#>  modelr        0.1.6      2020-02-22 [1] CRAN (R 4.0.0)                    
#>  munsell       0.5.0      2018-06-12 [1] CRAN (R 4.0.0)                    
#>  pillar        1.4.6      2020-07-10 [1] CRAN (R 4.0.0)                    
#>  pkgbuild      1.1.0      2020-07-13 [1] CRAN (R 4.0.0)                    
#>  pkgconfig     2.0.3      2019-09-22 [1] CRAN (R 4.0.0)                    
#>  pkgload       1.1.0      2020-05-29 [1] CRAN (R 4.0.0)                    
#>  prettyunits   1.1.1      2020-01-24 [1] CRAN (R 4.0.0)                    
#>  processx      3.4.4      2020-09-03 [1] CRAN (R 4.0.0)                    
#>  ps            1.3.4      2020-08-11 [1] CRAN (R 4.0.0)                    
#>  purrr       * 0.3.4      2020-04-17 [1] CRAN (R 4.0.0)                    
#>  R6            2.4.1      2019-11-12 [1] CRAN (R 4.0.0)                    
#>  Rcpp          1.0.5      2020-07-06 [1] CRAN (R 4.0.0)                    
#>  readr       * 1.3.1      2018-12-21 [1] CRAN (R 4.0.0)                    
#>  readxl        1.3.1      2019-03-13 [1] CRAN (R 4.0.0)                    
#>  remotes       2.1.1      2020-02-15 [1] CRAN (R 4.0.0)                    
#>  reprex        0.3.0      2019-05-16 [1] CRAN (R 4.0.0)                    
#>  reticulate    1.16       2020-05-27 [1] CRAN (R 4.0.0)                    
#>  rlang         0.4.7      2020-07-09 [1] CRAN (R 4.0.0)                    
#>  rmarkdown     2.3.5      2020-09-15 [1] Github (rstudio/rmarkdown@949c7e3)
#>  rprojroot     1.3-2      2018-01-03 [1] CRAN (R 4.0.0)                    
#>  rstudioapi    0.11       2020-02-07 [1] CRAN (R 4.0.0)                    
#>  rvest         0.3.5      2019-11-08 [1] CRAN (R 4.0.0)                    
#>  scales        1.1.1      2020-05-11 [1] CRAN (R 4.0.0)                    
#>  sessioninfo   1.1.1      2018-11-05 [1] CRAN (R 4.0.0)                    
#>  sloop         1.0.1      2019-02-17 [1] CRAN (R 4.0.0)                    
#>  stringi       1.5.3      2020-09-09 [1] CRAN (R 4.0.0)                    
#>  stringr     * 1.4.0      2019-02-10 [1] CRAN (R 4.0.0)                    
#>  testthat      2.3.2      2020-03-02 [1] CRAN (R 4.0.0)                    
#>  tibble      * 3.0.3      2020-07-10 [1] CRAN (R 4.0.0)                    
#>  tidyr       * 1.1.1      2020-07-31 [1] CRAN (R 4.0.0)                    
#>  tidyselect    1.1.0      2020-05-11 [1] CRAN (R 4.0.0)                    
#>  tidyverse   * 1.3.0      2019-11-21 [1] CRAN (R 4.0.0)                    
#>  usethis       1.6.1      2020-04-29 [1] CRAN (R 4.0.0)                    
#>  utf8          1.1.4      2018-05-24 [1] CRAN (R 4.0.0)                    
#>  vctrs         0.3.4      2020-08-29 [1] CRAN (R 4.0.0)                    
#>  withr         2.2.0      2020-04-20 [1] CRAN (R 4.0.0)                    
#>  xfun          0.17       2020-09-09 [1] CRAN (R 4.0.0)                    
#>  xml2          1.3.2      2020-04-23 [1] CRAN (R 4.0.0)                    
#>  yaml          2.2.1      2020-02-01 [1] CRAN (R 4.0.0)                    
#> 
#> [1] /home/mdneuzerling/R/x86_64-pc-linux-gnu-library/4.0
#> [2] /usr/local/lib/R/site-library
#> [3] /usr/lib/R/site-library
#> [4] /usr/lib/R/library

The Julia logo at the top of this image is the intellectual property of Stefan Karpinski, who allows its use for non-commercial purposes.

The S4 system is capable of multiple dispatch. ↩︎
This example was discussed in a StackOverflow post that I can no longer find. ↩︎