Using Vagrant to scale data science projects with Google Cloud Compute

Linux
Statistics
Author

Vinh Nguyen

Published

March 7, 2016

When I was in graduate school, I made heavy use of the school's computing infrastructure for my research by scheduling many simulation jobs, utilizing multiple (if not all) compute nodes in parallel using Grid Engine. In my professional life, I've always pushed to have a computing environment dedicated for research and analysis. This was typically in the form of a Linux or UNIX server with many CPU cores (64 one) and as much ram as I could get. The difficulty in getting the approval to have such an environment depended on the company's culture, so YMMV. The beauty of the work setup over the graduate school setup is that a job scheduler was never needed as the number of concurrent users vying for compute cycles are drastically less at work. When building a computing environment, I always try to build the beefiest server possible (translation: that I could get approval for) because I never want to run into a project that the server couldn't handle (eg, loading a very large data set into memory with R). However, it's hard to future-proof all projects completely so the line had to be drawn somewhere and thus the number of CPU cores and memory had a limit.

Now, with the ubiquity of computing environments offered by different cloud providers (eg, Amazon EC2, Google Compute Engine, Microsoft Azure, HP Public Cloud, Sense, and Domino Data Labs), spinning up an on-demand compute server or a cluster of nodes to scale data analysis projects is pretty simple, straightforward, and relatively cheap (no need to invest thousands of dollars to build a beefy server that reaches full capacity in <1% of the time). One could leverage these cloud services both in the work environment (if one could get it approved) and for personal use (eg, a Kaggle competition).

Sense and Domino Data Labs have servers pre-configured with many standard open source tools for data analysis. They are good options for quickly jumping into analysis. With the more generic cloud providers, one typically spins up a server with a standard Linux OS and then proceed to install and configure the necessary tools. To streamline the scaling process, Vagrant allows one to spin up, provision, manage, and destroy servers from the various providers in a simple and consistent manner. I'll illustrate how one might use Vagrant to spin up a compute server on the Google Compute Engine (GCE) to analyze data. I'm only choosing GCE at the time of this writing because it appears to be the cheapest and because it charges by the minute (10 minutes miminum), unlike Amazon EC2 which charges by the hour.

gcloud init # authenticate; choose a default zone, I chose "us-central1-b", which will show up in subsequent sections
vagrant plugin install vagrant-google
vagrant box add gce https://github.com/mitchellh/vagrant-google/raw/master/google.box
# -*- mode: ruby -*-
# vi: set ft=ruby :
Vagrant.configure("2") do |config|
  config.vm.box = "gce"

  config.vm.provider :google do |google, override|
    google.google_project_id = "MY_PROJECT_ID"
    google.google_client_email = "MY_ACCOUNT@MY_PROJECT_ID.iam.gserviceaccount.com"
    google.google_json_key_location = "/absolute/path/to/google_json_key.json"
    google.zone = "us-central1-b"
    google.machine_type= "n1-standard-1"
    # google.machine_type = "n1-highmem-2"
    # google.machine_type = "n1-standard-16"
    # google.machine_type = "n1-highmem-8"
    google.name = "gce-instance"
    google.image = "ubuntu-1404-trusty-v20160222" # may change, use "gcloud compute images list" to check
    # google.image = "image-ds"
    google.disk_name = "disk-ds" # ds for data science
    google.disk_size = "10"
    override.ssh.username = "MY_USERNAME"
    override.ssh.private_key_path = "/path/to/.ssh/id_rsa"
  end
  config.vm.provision :shell, path: "provision.sh"
#! /usr/bin/env bash

touch /tmp/foo # whoami?

# configure mail so I could communicate status updates
# http://askubuntu.com/questions/12917/how-to-send-mail-from-the-command-line,
# http://www.binarytides.com/linux-mail-command-examples/
# http://superuser.com/questions/795883/how-can-i-set-a-default-account-in-heirloom-mailx
cat > /tmp/nail.rc <<EOF
set smtp-use-starttls
set smtp-auth=login
set smtp=smtp://smtp.gmail.com:587
set smtp-auth-user=MY_GMAIL@gmail.com
set smtp-auth-password="MY_GMAIL_PW"
EOF
sudo su
cat /tmp/nail.rc >> /etc/nail.rc
exit
rm /tmp/nail.rc

# configure screen
cat > ~/.screenrc <<EOF
escape ^lL
bind c screen 1
bind 0 select 10
screen 1
select 1
autodetach on
startup_message off
EOF
# Software installation
sudo su
echo "deb http://cran.rstudio.com/bin/linux/ubuntu trusty/" > /etc/apt/sources.list.d/r.list
apt-key adv --keyserver keyserver.ubuntu.com --recv-keys E084DAB9
apt-get update 
apt-get -y install r-base-dev libcurl4-openssl-dev libssl-dev git
apt-get -y install heirloom-mailx # setup mail
exit # sudo

# R
wget https://mran.revolutionanalytics.com/install/mro/3.2.3/MRO-3.2.3-Ubuntu-14.4.x86_64.deb
wget https://mran.revolutionanalytics.com/install/mro/3.2.3/RevoMath-3.2.3.tar.gz
sudo dpkg -i MRO-3.2.3-Ubuntu-14.4.x86_64.deb
tar -xzf RevoMath-3.2.3.tar.gz
cd RevoMath
sudo ./RevoMath.sh # choose option 1, then agree
cd ..
rm -rf RevoMath
sudo R --vanilla <<EOF
install.packages(c("data.table","readr","randomForest","gbm","glmnet","ROCR","devtools"), repos="http://cran.rstudio.com")
# options(unzip = 'internal') # vinh https://github.com/RevolutionAnalytics/RRO/issues/37
# devtools::install_github("dmlc/xgboost", subdir = "R-package")
install.packages("drat", repos="https://cran.rstudio.com") # https://github.com/dmlc/xgboost/issues/776
drat:::addRepo("dmlc")
install.packages("xgboost", repos="http://dmlc.ml/drat/", type="source")
EOF

# Python
wget https://3230d63b5fc54e62148e-c95ac804525aac4b6dba79b00b39d1d3.ssl.cf1.rackcdn.com/Anaconda-2.2.0-Linux-x86_64.sh
bash Anaconda-2.2.0-Linux-x86_64.sh
# scroll and yes

# VW
sudo apt-get -y install libtool libboost1.55-*

# Java
sudo add-apt-repository ppa:webupd8team/java
sudo apt-get update
sudo apt-get -y install oracle-java7-installer

# H2O
# http://www.h2o.ai/download/h2o/r
sudo R --vanilla <<EOF
# The following two commands remove any previously installed H2O packages for R.
if ("package:h2o" %in% search()) { detach("package:h2o", unload=TRUE) }
if ("h2o" %in% rownames(installed.packages())) { remove.packages("h2o") }

# Next, we download packages that H2O depends on.
pkgs <- c("methods","statmod","stats","graphics","RCurl","jsonlite","tools","utils")
for (pkg in pkgs) {
    if (! (pkg %in% rownames(installed.packages()))) { install.packages(pkg, repos='http://cran.rstudio.com/') }
}

# Now we download, install and initialize the H2O package for R.
install.packages("h2o", type="source", repos=(c("http://h2o-release.s3.amazonaws.com/h2o/rel-tukey/6/R")))
library(h2o)
EOF
gcloud compute instances set-disk-auto-delete gce-instance --no-auto-delete --disk disk-ds # don't delete root disk https://cloud.google.com/compute/docs/disks/persistent-disks#updateautodelete
vagrant destroy --force
gcloud compute images create image-ds --source-disk disk-ds --source-disk-zone us-central1-b
gcloud compute images list
vagrant up
gcloud compute instances set-disk-auto-delete gce-instance --auto-delete --disk disk-ds
vagrant destroy # now disk is also destroyed, only image is left.
# -*- mode: ruby -*-
# vi: set ft=ruby :
Vagrant.configure("2") do |config|
  config.vm.box = "gce"

  config.vm.provider :google do |google, override|
    google.google_project_id = "MY_PROJECT_ID"
    google.google_client_email = "MY_ACCOUNT@MY_PROJECT_ID.iam.gserviceaccount.com"
    google.google_json_key_location = "/absolute/path/to/google_json_key.json"
    google.zone = "us-central1-b"
    google.machine_type= "n1-standard-1"
    # google.machine_type = "n1-highmem-2"
    # google.machine_type = "n1-standard-16"
    # google.machine_type = "n1-highmem-8"
    google.name = "gce-instance"
    # google.image = "ubuntu-1404-trusty-v20160222"
    google.image = "image-ds"
    google.disk_name = "disk-ds" # ds for data science
    google.disk_size = "10"
    override.ssh.username = "MY_USERNAME"
    override.ssh.private_key_path = "/path/to/.ssh/id_rsa"
  end
  config.vm.provision :shell, path: "provision.sh"

When one needs more CPU's, memory, or disk size, just modify google.machine_type and google.disk_size per the pricing sheet.

With this setup, I can put relevant files in vagrant_google/my_project. My current work flow is something like the following:

vagrant up # provision server
vagrant ssh # log in
screen # start a persistent terminal session
# paste the following
mailaddr=MY_CONTACT_EMAIL
projdir=/vagrant/my_project
logdir=/vagrant/my_project/log
mkdir -p $logdir

# paste in code that I want to run, eg
R --no-save < $projdir/myfile.R > $logdir/myfile.Rlog
python $projdir/myfile.py > $logdir/myfile.pylog
echo '' | mailx -s 'Job Done' $mailaddr
# detach screen session

When the job is done, I will receive a notification email. When the job is currently running, I can use vagrant ssh to log into the server and re-attach my running screen session to inspect what is going on or make any changes. When I'm done, I could just destroy the server (the content of /vagrant/ will get re-synced to vagrant_google). I could also create batch jobs to start the server, run code, notify me, and destroy the server in one shot if I didn't want to run my job interactively.

Conclusion: With Vagrant, open source tools, and the cloud providers, I can spin up as much compute resources as I need in order to scale out my data analysis project. If I had the need to build stand-alone applications, I could also incorporate Docker.