Set up a Data Science Ubuntu VM for Development

The steps below leverage information from this and this site, along with some modifications and sequencing before I could get my VM configured 100%.

STEP 1: Install VMware tools after creating Ubuntu VM

  • Extract VMwareTools.x.x.x-xxxx.tar.gz:
    • Power on the virtual machine.
    • Log in to the virtual machine using an account with administrator or root privileges. Select:
      • For Fusion: Virtual Machine > Install VMware Tools.
      • For Workstation: VM > Install VMware Tools.
      • For Player: Player > Manage > Install VMware Tools
    • Open the VMware Tools CD mounted on the Ubuntu desktop.
    • Right-click the file name that is similar to VMwareTools.x.x.x-xxxx.tar.gz, click Extract to, and select Ubuntu Desktop to save the extracted contents.
    • The vmware-tools-distrib folder is extracted to the Ubuntu Desktop.
  • Install VMware Tools in Ubuntu:
    • Open a Terminal window.
    • In the Terminal, run this command to navigate to the vmware-tools-distrib folder:
      • cd Desktop/vmware-tools-distrib
    • Run this command to install VMware Tools
      • sudo ./vmware-install.pl -d
  • Restart the Ubuntu virtual machine after the VMware Tools installation completes.

While Anaconda is a quick and easy way of getting Python libraries installed, I chose not to go with this option following some lessons learnt in package management and versioning during a recent Deep Learning project. The following steps therefore install all required packages manually and in a sequence that I found works best for me.
 

STEP 2: Install PRE-REQUISITE PYTHON LIBRARIES ON THE VM

$ sudo apt-get -y install git curl vim tmux htop ranger
$ sudo apt-get -y install python-dev python-pip
$ sudo apt-get -y install python-serial python-setuptools python-smbus

 

STEP 3: SET UP A CONTAINER FOR DATA SCIENCE ON THE VM

$ sudo pip install virtualenv
$ cd ~/
$ mkdir venv
$ pushd venv
$ virtualenv data-science
$ popd
$ source ~/venv/data-science/bin/activate
$ pip install --upgrade setuptools
$ pip install virtualenvwrapper
$ pip install cython
$ pip install nose

 

STEP 4: INSTALL PYTHON DATA SCIENCE LIBRARIES ON THE VM

$ sudo apt-get -y install python-numpy python-scipy python-matplotlib
$ sudo apt-get -y install ipython ipython-notebook
$ sudo apt-get -y installpython-pandas python-sympy python-nose
$ pip install jupyter
$ sudo apt-get -y install libfreetype6-dev libpng12-dev libjs-mathjax
$ sudo apt-get -y install fonts-mathjax libgcrypt11-dev libxft-dev
$ pip install matplotlib
$ sudo apt-get install libatlas-base-dev gfortran
$ pip install Seaborn && pip install statsmodels
$ pip install scikit-learn && pip install numexpr
$ pip install bottleneck && pip install pandas
$ pip install SQLAlchemy && pip install pyzmq
$ pip install jinja2 && pip install tornado
$ pip install nltk && pip install gensim
$ pip install tensorflow && pip install keras

 

STEP 5: CONFIGURE NETWORKING, SSH, FIREWALL & JUPYTER ON THE VM

  • On the host O/S, browse to the following setting via the VMware Workstation dropdown menu: VM -> Settings -> Network Adapter, and set this to bridged (automatic)
  • Log in to the VM and run the following to enable SSH
$ sudo apt-get install openssh-server
$ sudo ufw allow from host_ip to any port 22
$ nice jupyter notebook

Once the Jupyter notebook server runs in Terminal following the last command, make note of the token querystring parameter at the end of the notebook URL http://localhost:8888/?token=62gwj3k28djdkelsnab7293kkl0172hdu3ks9al7sj3kb1j

 

STEP 6: SSH TUNNEL FROM HOST WINDOWS O/S TO GUEST UBUNTU VM

Set up the following in Putty on Windows to enable SSH Tunnelling. The below screenshot has the settings for the Windows host

1

The below screenshot has the settings for the port forwarding, i.e.; the Jupyter server URL http://localhost:8000 on the Windows host O/S forwards to http://localhost:8888 on the Ubuntu VM

2

 

STEP 7: BROWSE TO JUPYTER NOTEBOOK ON HOST O/S

Browse to http://localhost:8000 on your guest O/S and verify that you can access the Jupyter notebook. When run for the first time, the web page will request a token. Enter the token you saved from Step 5.

If you have any problems with connectivity it is likely due to the Ubuntu Guest O/S firewall ufw, see Step 5 above for ufw configuration.

Configure Theano & CUDA for Deep Learning on a Mac

STEP 1 – Install Theano
After installing Anaconda run the following command in Terminal

$ conda install theano pygpu


STEP 2 – Install the correct CUDA driver based on your model of Mac
Browse to this link to download the correct version of the CUDA driver for your Mac. Make sure to click on the driver link and choose the Supported Products tab to determine the Mac hardware that the particular driver supports.

For some older Macs, the version 6.5.45 driver is the best choice.


STEP 3 – Install the CUDA toolkit, but don’t upgrade the driver
Download the right version of the CUDA toolkit for your Mac from the archive here.

For some older Macs, version 6.5 toolkit is the best choice


STEP 4 – Install XCode
Download the XCode app from the Apple site and install it on your Mac


STEP 5 – Install XCode Command Line tools
Open a Terminal and run the following command

$ xcode-select --install

Choose to install the command line tools


STEP 6 – Check the cc compiler
Open a Terminal and run the following command

$ /usr/bin/cc --version

 

STEP 7 – Update your .bash_profile file
Add the following to .bash_profile


export LD_LIBRARY_PATH=/Developer/NVIDIA/CUDA-6.5/lib/
export CUDA_ROOT=/Developer/NVIDIA/CUDA-6.5/
export THEANO_FLAGS='mode=FAST_RUN,device=gpu,floatX=float32'

export PATH=/Developer/NVIDIA/CUDA-6.5/bin${PATH:+:${PATH}}
export DYLD_LIBRARY_PATH=/Developer/NVIDIA/CUDA-6.5/lib\
${DYLD_LIBRARY_PATH:+:${DYLD_LIBRARY_PATH}}

 

STEP 8 – Test the NVCC compiler
Run the following in Terminal


$ /Developer/NVIDIA/CUDA-6.0/bin/nvcc -V

 

STEP 9 – Switch to the Samples Directory 
Switch to Samples directory that were installed as part of the toolkit


cd /Developer/NVIDIA/CUDA-6.0/samples/

 

STEP 10 – Make the Samples
Run the below, one line at a time and make sure you don’t get any errors


make -C 0_Simple/vectorAdd
make -C 0_Simple/vectorAddDrv
make -C 1_Utilities/deviceQuery
make -C 1_Utilities/bandwidthTest

 

STEP 11 – Run the Samples
Switch to the relevant directory to run the compiled files


cd /Developer/NVIDIA/CUDA-6.5/samples/bin/x86_64/darwin/release

Make sure you get the relevant output when running the below line by line


./deviceQuery
./bandwidthTest

 

STEP 12 – Configure Theano to use the GPU
Create a file called .theanorc in your HOME directory and add the following to it


[blas]
ldflags =

[global]
floatX = float32
device = gpu

[nvcc]
fastmath = True

[gcc]
cxxflags = -ID:\MinGW\include

[cuda]
# Set to where the cuda drivers are installed.
root=/usr/local/cuda/

 


STEP 13 – Run the following to confirm that Theano now uses the GPU


from theano import function, config, shared, tensor
import numpy
import time

vlen = 10 * 30 * 768  # 10 x #cores x # threads per core
iters = 1000

rng = numpy.random.RandomState(22)
x = shared(numpy.asarray(rng.rand(vlen), config.floatX))
f = function([], tensor.exp(x))
print(f.maker.fgraph.toposort())
t0 = time.time()
for i in range(iters):
    r = f()
t1 = time.time()
print("Looping %d times took %f seconds" % (iters, t1 - t0))
print("Result is %s" % (r,))
if numpy.any([isinstance(x.op, tensor.Elemwise) and
              ('Gpu' not in type(x.op).__name__)
              for x in f.maker.fgraph.toposort()]):
    print('Used the cpu')
else:
    print('Used the gpu')

Installing Key Data Science Libraries in R

I use the below script to set up my R environment with required Data Science libraries.

Through trial and error, I’ve found that MacOS/X is my preferred O/S environment for Data Science. Some of the below libraries (like ‘bigrf’ used for Random Forests for larger datasets) are not available for Windows. Also for packages that I needed to compile, MacOS/X just worked every time.

It is important to ensure that Java is installed on your system prior to running the below.

STEP 1: INSTALL JAVA & POINT R TO IT

On a Mac add the following to your bash_profile

$ export LD_LIBRARY_PATH=/Library/Java/JavaVirtualMachines/jdk1.8.0_102.jdk/Contents/Home/jre/lib:/Library/Java/JavaVirtualMachines/jdk1.8.0_102.jdk/Contents/Home/jre/lib/server

Run the below to create the relevant links

$ sudo ln -s $(/usr/libexec/java_home)/jre/lib/server/libjvm.dylib /usr/local/lib

Point R to Java

$ sudo R CMD javareconf

Install the pre-requisite rJava package in R

install.packages('rJava', type='source')

STEP 2: PRE-MODELING STAGE PACKAGES

# Data Visualisation
install.packages("ggvis")
install.packages("ggplot2")
install.packages("googleVis")

# Data Transformation
install.packages("plyr")
install.packages("data.table")

# Missing Value Imputations
install.packages("missForest")
install.packages("missMDA")

# Outlier Detection
install.packages("outliers")
install.packages("evir")

# Feature Selection
install.packages("features")
install.packages("RRF")

# Dimension Reduction
install.packages("FactoMineR")
install.packages("CCP")

STEP 3: MODELING STAGE PACKAGES

# Continuous regression
install.packages("car")
install.packages("randomForest")


# Ordinal regression
install.packages("rminer")
install.packages("CORElearn")


# Classification
install.packages("caret")
install.packages("devtools")
library(devtools)
install_github("bigrf", repo='aloysius-lim/bigrf')
#install.packages("~/Downloads/bigrf_0.1-11.tar.gz", repos = NULL, type = "source") #LINUX

# Clustering
install.packages("cba")
install.packages("Rankcluster")

# Time Series
install.packages("forecast")
install.packages("ltsa")

# Survival
install.packages("survival")
install.packages("BaSTA")

STEP 4: POST MODELING STAGE PACKAGES

# General Model Validation
install.packages("lsmeans")
install.packages("comparison")

# Regression Validation
install.packages("regtest")
install.packages("ACD")

# Classification Validation
install.packages("binomTools")
install.packages("Daim")

# Clustering Validation
install.packages("clusteval")
install.packages("sigclust")

# ROC Analysis
install.packages("pROC")
install.packages("timeROC")

STEP 4: OTHER USEFUL PACKAGES

# Improve Performance
install.packages("Rcpp")
install.packages("parallel")

# Work with Web
install.packages("XML")
install.packages("jsonlite")
install.packages("httr")

# Report Results
install.packages("shiny")
install.packages("rmarkdown")

# Text Mining
install.packages("tm")
install.packages("twitteR")

# Database 
install.packages("sqldf")

# Install unixodbc first before RODBC
#On a Mac run: brew install unixodbc
install.packages("RODBC")
install.packages("RMongo")

# Miscellaneous 
install.packages("swirl")
install.packages("reshape2")
install.packages("qcc")
install.packages("qdap")

STEP 5: RUN INVENTORY OF INSTALLED PACKAGES

# Get a full list of installed packages
write.csv(installed.packages(), file = "InstalledPackages.csv")

Reading in Large Datasets in R

R can appear to run slow while processing larger data sets. Some tips on dealing with the issue of preprocessing Big Data in R are listed below:

TIP 1 – Set the colclasses Argument
While reading in large datasets using read.table(), R can seem to lag as it attempts to infer the class of each column in the background by scanning the dataset. This can slow down the process of reading in the dataset, especially with Big Data.

The colclasses argument in read.table() must be set for large datasets. Use the below code to figure out the classes for each of your columns. This only uses the first 100 rows to infer the column class:

initial <- read.table("datatable.txt", nrows=100)
classes <- sapply(initial, class)
tabAll <- read.table("datatable.txt", colClasses=classes)


TIP 2 – Set the nrows Argument
Set the nrows argument in read.table(). If you’re on a Mac or Linux use the command line utility wc to get an estimate of the number of lines/rows in your input file. This doesn’t have to exact, just close enough.

TIP 3 – Set the comment.char Argument
If there are no comment lines, then set the comment.char argument to be blank.

TIP 4 – System Contingency
When processing large datasets ensure that there is no resource contention on the same machine when using the RServer client, or in a distributed environment if you’ve chosen to use RStudio Server or Microsoft R Server.

TIP 5 – Making Decisions about RAM Restrictions
If the size of the dataset is larger than the memory available on your computer then you will not be able to process the data effectively in R. A 64-bit system will obviously be able to perform better than a 32-bit one.

There is a rough formula that can be used for calculating the amount of RAM required for R to hold the dataset in memory. The MAJOR assumption in the examples below is that all columns are of the numeric class and therefore use 8 bytes/numeric. I use this as a “gut feel” when planning the pre-processing of large datasets.

amount_of_RAM_in_GB <- (no_of_rows * no_of_columns * 8)/2^20/1024

Please note that the actual memory required by the computer to run effectively is roughly double the below required by R.

Rough Calculations:

  • 4GB RAM for 2.69M Rows, 200 Columns
  • 16GB RAM for 4.3M Rows, 500 Columns
  • 75GB RAM for 10M Rows, 1000 Columns
  • 15TB RAM for 1B Rows, 2000 Columns

TIP 6 – Remove NA values
Based on your use case you may make the decision to exclude missing values prior to developing algorithms for machine learning. I’ve followed an approach of extracting bad data (or column data with “NA”) into a separate dataset that I may choose to use for analysis at a later stage.

See the below example for extracting bad data and retaining clean data. This should work also a larger data set.

# Create dummy data for data frame
a <- c(1, 2, 3, 4, NA)
b <- c(6, 7, 8, NA, 10)
c <- c(11, 12, NA, 14, 15)
d <- c(16, NA, 18, 19, 20)
e <- c(21, 22, 23, 24, 25)

# Combine vectors to form a larger data frame
df <- data.frame(a, b, c, d, e)

# Append dataframes with row bind
rdf <- rbind(df, df, df) 

# Create new data frame with only the clean data, i,e; no NA
rdf_only_good_data <- na.omit(rdf)

# Create a new data frame with only bad data, i.e.; only NA
rdf_only_bad_data <- rdf[!complete.cases(rdf),] 

Output:

> rdf
    a  b  c  d  e
1   1  6 11 16 21
2   2  7 12 NA 22
3   3  8 NA 18 23
4   4 NA 14 19 24
5  NA 10 15 20 25
6   1  6 11 16 21
7   2  7 12 NA 22
8   3  8 NA 18 23
9   4 NA 14 19 24
10 NA 10 15 20 25
11  1  6 11 16 21
12  2  7 12 NA 22
13  3  8 NA 18 23
14  4 NA 14 19 24
15 NA 10 15 20 25

> rdf_only_good_data
   a b  c  d  e
1  1 6 11 16 21
6  1 6 11 16 21
11 1 6 11 16 21

> rdf_only_bad_data
    a  b  c  d  e
2   2  7 12 NA 22
3   3  8 NA 18 23
4   4 NA 14 19 24
5  NA 10 15 20 25
7   2  7 12 NA 22
8   3  8 NA 18 23
9   4 NA 14 19 24
10 NA 10 15 20 25
12  2  7 12 NA 22
13  3  8 NA 18 23
14  4 NA 14 19 24
15 NA 10 15 20 25