Using Vagrant to scale data science projects with Google Cloud Compute

Author

Vinh Nguyen

Published

March 7, 2016

When I was in graduate school, I made heavy use of the school's computing infrastructure for my research by scheduling many simulation jobs, utilizing multiple (if not all) compute nodes in parallel using Grid Engine. In my professional life, I've always pushed to have a computing environment dedicated for research and analysis. This was typically in the form of a Linux or UNIX server with many CPU cores (64 one) and as much ram as I could get. The difficulty in getting the approval to have such an environment depended on the company's culture, so YMMV. The beauty of the work setup over the graduate school setup is that a job scheduler was never needed as the number of concurrent users vying for compute cycles are drastically less at work. When building a computing environment, I always try to build the beefiest server possible (translation: that I could get approval for) because I never want to run into a project that the server couldn't handle (eg, loading a very large data set into memory with R). However, it's hard to future-proof all projects completely so the line had to be drawn somewhere and thus the number of CPU cores and memory had a limit.

Now, with the ubiquity of computing environments offered by different cloud providers (eg, Amazon EC2, Google Compute Engine, Microsoft Azure, HP Public Cloud, Sense, and Domino Data Labs), spinning up an on-demand compute server or a cluster of nodes to scale data analysis projects is pretty simple, straightforward, and relatively cheap (no need to invest thousands of dollars to build a beefy server that reaches full capacity in <1% of the time). One could leverage these cloud services both in the work environment (if one could get it approved) and for personal use (eg, a Kaggle competition).

Sense and Domino Data Labs have servers pre-configured with many standard open source tools for data analysis. They are good options for quickly jumping into analysis. With the more generic cloud providers, one typically spins up a server with a standard Linux OS and then proceed to install and configure the necessary tools. To streamline the scaling process, Vagrant allows one to spin up, provision, manage, and destroy servers from the various providers in a simple and consistent manner. I'll illustrate how one might use Vagrant to spin up a compute server on the Google Compute Engine (GCE) to analyze data. I'm only choosing GCE at the time of this writing because it appears to be the cheapest and because it charges by the minute (10 minutes miminum), unlike Amazon EC2 which charges by the hour.

Install Vagrant
Install and configure the Google Cloud SDK:

gcloud init # authenticate; choose a default zone, I chose "us-central1-b", which will show up in subsequent sections

Install the Vagrant-Google plugin. On Windows, Vagrant-Google version 1.8.1+ is required, which hasn't been released at the time of this writing. A work-around is to spin up an Ubuntu server using Vagrant and Virtualbox on Windows, install Vagrant in that Ubuntu server, and go from there. Run the following to install the Vagrant-Google plugin:

vagrant plugin install vagrant-google

Follow the instructions here to create a Service Account within the GCE console (website), download the json private key (used by the Google Cloud SDK, call it google_json_key.json), copy the api service account email address and project ID, and add your ssh public key (eg, ~/.ssh/id_rsa.pub) to the Compute Engine's metadata.
Add the GCE box (template) to Vagrant:

vagrant box add gce https://github.com/mitchellh/vagrant-google/raw/master/google.box

Create a project folder (eg, vagrant_google) and cd into it. Everything in this folder will get synced to /vagrant/ for any provisioned server.
Create the file Vagrantfile:

# -*- mode: ruby -*-
# vi: set ft=ruby :
Vagrant.configure("2") do |config|
  config.vm.box = "gce"

  config.vm.provider :google do |google, override|
    google.google_project_id = "MY_PROJECT_ID"
    google.google_client_email = "MY_ACCOUNT@MY_PROJECT_ID.iam.gserviceaccount.com"
    google.google_json_key_location = "/absolute/path/to/google_json_key.json"
    google.zone = "us-central1-b"
    google.machine_type= "n1-standard-1"
    # google.machine_type = "n1-highmem-2"
    # google.machine_type = "n1-standard-16"
    # google.machine_type = "n1-highmem-8"
    google.name = "gce-instance"
    google.image = "ubuntu-1404-trusty-v20160222" # may change, use "gcloud compute images list" to check
    # google.image = "image-ds"
    google.disk_name = "disk-ds" # ds for data science
    google.disk_size = "10"
    override.ssh.username = "MY_USERNAME"
    override.ssh.private_key_path = "/path/to/.ssh/id_rsa"
  end
  config.vm.provision :shell, path: "provision.sh"

Create the file provision.sh (this gets run whenever a server is provisioned via vagrant up):

#! /usr/bin/env bash

touch /tmp/foo # whoami?

# configure mail so I could communicate status updates
# http://askubuntu.com/questions/12917/how-to-send-mail-from-the-command-line,
# http://www.binarytides.com/linux-mail-command-examples/
# http://superuser.com/questions/795883/how-can-i-set-a-default-account-in-heirloom-mailx
cat > /tmp/nail.rc <<EOF
set smtp-use-starttls
set smtp-auth=login
set smtp=smtp://smtp.gmail.com:587
set smtp-auth-user=MY_GMAIL@gmail.com
set smtp-auth-password="MY_GMAIL_PW"
EOF
sudo su
cat /tmp/nail.rc >> /etc/nail.rc
exit
rm /tmp/nail.rc

# configure screen
cat > ~/.screenrc <<EOF
escape ^lL
bind c screen 1
bind 0 select 10
screen 1
select 1
autodetach on
startup_message off
EOF

Run vagrant up and Vagrant will provision the server.
Run vagrant ssh to log into the server via ssh.
Install the necessary tools:

# Software installation
sudo su
echo "deb http://cran.rstudio.com/bin/linux/ubuntu trusty/" > /etc/apt/sources.list.d/r.list
apt-key adv --keyserver keyserver.ubuntu.com --recv-keys E084DAB9
apt-get update 
apt-get -y install r-base-dev libcurl4-openssl-dev libssl-dev git
apt-get -y install heirloom-mailx # setup mail
exit # sudo

# R
wget https://mran.revolutionanalytics.com/install/mro/3.2.3/MRO-3.2.3-Ubuntu-14.4.x86_64.deb
wget https://mran.revolutionanalytics.com/install/mro/3.2.3/RevoMath-3.2.3.tar.gz
sudo dpkg -i MRO-3.2.3-Ubuntu-14.4.x86_64.deb
tar -xzf RevoMath-3.2.3.tar.gz
cd RevoMath
sudo ./RevoMath.sh # choose option 1, then agree
cd ..
rm -rf RevoMath
sudo R --vanilla <<EOF
install.packages(c("data.table","readr","randomForest","gbm","glmnet","ROCR","devtools"), repos="http://cran.rstudio.com")
# options(unzip = 'internal') # vinh https://github.com/RevolutionAnalytics/RRO/issues/37
# devtools::install_github("dmlc/xgboost", subdir = "R-package")
install.packages("drat", repos="https://cran.rstudio.com") # https://github.com/dmlc/xgboost/issues/776
drat:::addRepo("dmlc")
install.packages("xgboost", repos="http://dmlc.ml/drat/", type="source")
EOF

# Python
wget https://3230d63b5fc54e62148e-c95ac804525aac4b6dba79b00b39d1d3.ssl.cf1.rackcdn.com/Anaconda-2.2.0-Linux-x86_64.sh
bash Anaconda-2.2.0-Linux-x86_64.sh
# scroll and yes

# VW
sudo apt-get -y install libtool libboost1.55-*

# Java
sudo add-apt-repository ppa:webupd8team/java
sudo apt-get update
sudo apt-get -y install oracle-java7-installer

# H2O
# http://www.h2o.ai/download/h2o/r
sudo R --vanilla <<EOF
# The following two commands remove any previously installed H2O packages for R.
if ("package:h2o" %in% search()) { detach("package:h2o", unload=TRUE) }
if ("h2o" %in% rownames(installed.packages())) { remove.packages("h2o") }

# Next, we download packages that H2O depends on.
pkgs <- c("methods","statmod","stats","graphics","RCurl","jsonlite","tools","utils")
for (pkg in pkgs) {
    if (! (pkg %in% rownames(installed.packages()))) { install.packages(pkg, repos='http://cran.rstudio.com/') }
}

# Now we download, install and initialize the H2O package for R.
install.packages("h2o", type="source", repos=(c("http://h2o-release.s3.amazonaws.com/h2o/rel-tukey/6/R")))
library(h2o)
EOF

Exit out of the ssh session. The server is still running. Let's set the root disk to not delete when the server is destroyed. Then save the root disk as an image. With this saved private image, one can provision multiple servers using the same image without having to re-install all the previously installed software on gce-instance. Once we have a saved image, restart the instance with the previous root disk, set the disk to automatically delete, and then delete the instance again (this time, the disk will also be destroyed). What we have left is an image that could be re-used. Image storage is not that costly if one wants to keep it around for quickly provisioning their computing environment.

gcloud compute instances set-disk-auto-delete gce-instance --no-auto-delete --disk disk-ds # don't delete root disk https://cloud.google.com/compute/docs/disks/persistent-disks#updateautodelete
vagrant destroy --force
gcloud compute images create image-ds --source-disk disk-ds --source-disk-zone us-central1-b
gcloud compute images list
vagrant up
gcloud compute instances set-disk-auto-delete gce-instance --auto-delete --disk disk-ds
vagrant destroy # now disk is also destroyed, only image is left.

With the saved private image, we could modify Vagrantfile to use this image whenever we spin up a server:

# -*- mode: ruby -*-
# vi: set ft=ruby :
Vagrant.configure("2") do |config|
  config.vm.box = "gce"

  config.vm.provider :google do |google, override|
    google.google_project_id = "MY_PROJECT_ID"
    google.google_client_email = "MY_ACCOUNT@MY_PROJECT_ID.iam.gserviceaccount.com"
    google.google_json_key_location = "/absolute/path/to/google_json_key.json"
    google.zone = "us-central1-b"
    google.machine_type= "n1-standard-1"
    # google.machine_type = "n1-highmem-2"
    # google.machine_type = "n1-standard-16"
    # google.machine_type = "n1-highmem-8"
    google.name = "gce-instance"
    # google.image = "ubuntu-1404-trusty-v20160222"
    google.image = "image-ds"
    google.disk_name = "disk-ds" # ds for data science
    google.disk_size = "10"
    override.ssh.username = "MY_USERNAME"
    override.ssh.private_key_path = "/path/to/.ssh/id_rsa"
  end
  config.vm.provision :shell, path: "provision.sh"

When one needs more CPU's, memory, or disk size, just modify google.machine_type and google.disk_size per the pricing sheet.

With this setup, I can put relevant files in vagrant_google/my_project. My current work flow is something like the following:

vagrant up # provision server
vagrant ssh # log in
screen # start a persistent terminal session
# paste the following
mailaddr=MY_CONTACT_EMAIL
projdir=/vagrant/my_project
logdir=/vagrant/my_project/log
mkdir -p $logdir

# paste in code that I want to run, eg
R --no-save < $projdir/myfile.R > $logdir/myfile.Rlog
python $projdir/myfile.py > $logdir/myfile.pylog
echo '' | mailx -s 'Job Done' $mailaddr
# detach screen session

When the job is done, I will receive a notification email. When the job is currently running, I can use vagrant ssh to log into the server and re-attach my running screen session to inspect what is going on or make any changes. When I'm done, I could just destroy the server (the content of /vagrant/ will get re-synced to vagrant_google). I could also create batch jobs to start the server, run code, notify me, and destroy the server in one shot if I didn't want to run my job interactively.

Conclusion: With Vagrant, open source tools, and the cloud providers, I can spin up as much compute resources as I need in order to scale out my data analysis project. If I had the need to build stand-alone applications, I could also incorporate Docker.