Everybody Loves Raymond: Running Animal Crossing Villagers through the Google Vision API

Everybody Loves Raymond: Running Animal Crossing Villagers through the Google Vision API

Animal Crossing: New Horizons kept me sane throughout the first Melbourne COVID lockdown. Now, in lockdown 4, it seems right that I should look back at this cheerful, relaxing game and do some data stuff. I’m going to take the Animal Crossing villagers in the Tidy Tuesday Animal Crossing dataset and combine it with survey data from the Animal Crossing Portal, giving each villager a measure of popularity. I’ll use the Google Cloud Vision API to annotate each of the villager thumbnails, and with these train a a (pretty poor) model of villager popularity.
Some Dockerfiles for Building R Package Binaries

Some Dockerfiles for Building R Package Binaries

R
I went down a strange path recently, trying to compile binaries of R packages for Linux. I’m not sure why — this area is pretty much covered by the RStudio Package Manager. I’ll leave my Dockerfiles here in case they’re of any use to a future wayward R programmer. The intention here is to build a Docker image that can build an R binary with the below command. I’m trying to build x86 binaries on my ARM Macbook, so I’m specifying the platform during both build and run.
Machine Learning Workflows with Julia

Machine Learning Workflows with Julia

I have a simple machine learning workflow that I recreate whenever I’m testing something new. I take some interesting data and a target, throw in some pre-processing, tune hyperparameters with cross-validation, and train a random forest. It’s all the basic ingredients for a machine learning model. Since I like Julia so much, I’ll recreate my simple machine learning workflow with Julia’s MLJ package. MLJ is like R’s parsnip, in that it unifies many machine learning packages with disparate APIs under a single syntactic umbrella.
Using Metaflow to Make Model Tuning Less Painful

Using Metaflow to Make Model Tuning Less Painful

I have a machine learning model that takes some time to train. Data pre-processing and model fitting can take 15–20 minutes. That’s not so bad, but I also want to tune my model to make sure I’m using the best hyper-parameters. With 16 different combinations of hyperparameters and 5-fold cross-validation, my 20 minutes can become a day or more. Metaflow is an open-source tool from the folks at Netflix that can be used to make this process less painful.
R on AWS Lambda with Containers

R on AWS Lambda with Containers

AWS has announced support for container images for their serverless computing platform Lambda. AWS doesn’t provide an R runtime for Lambda, and this was the excuse I needed to finally try to make one. An R runtime means that I can take advantage of AWS Lambda to put my R functions in the cloud. I don’t have to worry about provisioning servers or spinning up containers — the function itself is the star.
Determining system dependencies for R projects

Determining system dependencies for R projects

R
Locking down R package dependencies and versions is a solved problem, thanks to the easy-to-use renv package. System dependencies — those Linux packages that need to be installed to make certain R packages work — are a bit harder to manage. Option 1: Hard-coding The easiest option is to hard-code the system dependencies. I did this recently when I was creating a Dockerfile for a very simple Plumber API: RUN apt-get update -qq && apt-get -y --no-install-recommends install \ make \ libsodium-dev \ libicu-dev \ libcurl4-openssl-dev \ libssl-dev My Dockerfile used only three R packages and so its system dependencies were not complicated.
Hosting a Plumber API with Kubernetes

Hosting a Plumber API with Kubernetes

I’ve set myself an ambitious goal of building a Kubernetes cluster out of a couple of Raspberry Pis. This is pretty far out of my sphere of knowledge, so I have a lot to learn. I’ll be writing some posts to publish my notes and journal my experience publicly. In this post I’ll go through the basics of Kubernetes, and how I hosted a Plumber API in a Kubernetes cluster on Google Cloud Platform.
First Impressions of Julia from an R User

First Impressions of Julia from an R User

It’s no secret that I love R and begrudgingly use Python. But there’s a another option for data science, and it promises the speed of C with the ease of use of R/Python. That language is Julia, and it’s a delight to use. I took some time to learn the basics, and I’m sharing my impressions here. Julia is not the most popular language in the world Before I go on, there’s one thing I want to stress here: Julia is not as popular as Python or R for doing stuff with data.
Sourcing Data from S3 with Drake

Sourcing Data from S3 with Drake

R
drake is a package for orchestrating R workflows. Suppose I have some data in S3 that I want to pull into R through a drake plan. In this post I’ll use the S3 object’s ETag to make drake only re-download the data if it’s changed. This covers the scenario in which the object name in S3 stays the same. If I had, say, data being uploaded each day with an object name suffixed with the date, then I wouldn’t bother checking for any changes.
Tracking Tidymodels with MLflow

Tracking Tidymodels with MLflow

R
After I posted my efforts to use MLflow to serve a model with R, I was worried that people may think I don’t like MLflow. I want to declare this: MLflow is awesome. I’ll showcase its model tracking features, and how to integrate them into a tidymodels model. The Tracking component of MLflow can be used to record parameters, metrics and artifacts every time a model is trained. All of this information is presented in a very nice user interface.