Tag Archives: Data Science

Installing Key Data Science Libraries in R

I use the below script to set up my R environment with required Data Science libraries.

Through trial and error, I’ve found that MacOS/X is my preferred O/S environment for Data Science. Some of the below libraries (like ‘bigrf’ used for Random Forests for larger datasets) are not available for Windows. Also for packages that I needed to compile, MacOS/X just worked every time.

It is important to ensure that Java is installed on your system prior to running the below.

STEP 1: INSTALL JAVA & POINT R TO IT

On a Mac add the following to your bash_profile

$ export LD_LIBRARY_PATH=/Library/Java/JavaVirtualMachines/jdk1.8.0_102.jdk/Contents/Home/jre/lib:/Library/Java/JavaVirtualMachines/jdk1.8.0_102.jdk/Contents/Home/jre/lib/server

Run the below to create the relevant links

$ sudo ln -s $(/usr/libexec/java_home)/jre/lib/server/libjvm.dylib /usr/local/lib

Point R to Java

$ sudo R CMD javareconf

Install the pre-requisite rJava package in R

install.packages('rJava', type='source')

STEP 2: PRE-MODELING STAGE PACKAGES

# Data Visualisation
install.packages("ggvis")
install.packages("ggplot2")
install.packages("googleVis")

# Data Transformation
install.packages("plyr")
install.packages("data.table")

# Missing Value Imputations
install.packages("missForest")
install.packages("missMDA")

# Outlier Detection
install.packages("outliers")
install.packages("evir")

# Feature Selection
install.packages("features")
install.packages("RRF")

# Dimension Reduction
install.packages("FactoMineR")
install.packages("CCP")

STEP 3: MODELING STAGE PACKAGES

# Continuous regression
install.packages("car")
install.packages("randomForest")


# Ordinal regression
install.packages("rminer")
install.packages("CORElearn")


# Classification
install.packages("caret")
install.packages("devtools")
library(devtools)
install_github("bigrf", repo='aloysius-lim/bigrf')
#install.packages("~/Downloads/bigrf_0.1-11.tar.gz", repos = NULL, type = "source") #LINUX

# Clustering
install.packages("cba")
install.packages("Rankcluster")

# Time Series
install.packages("forecast")
install.packages("ltsa")

# Survival
install.packages("survival")
install.packages("BaSTA")

STEP 4: POST MODELING STAGE PACKAGES

# General Model Validation
install.packages("lsmeans")
install.packages("comparison")

# Regression Validation
install.packages("regtest")
install.packages("ACD")

# Classification Validation
install.packages("binomTools")
install.packages("Daim")

# Clustering Validation
install.packages("clusteval")
install.packages("sigclust")

# ROC Analysis
install.packages("pROC")
install.packages("timeROC")

STEP 4: OTHER USEFUL PACKAGES

# Improve Performance
install.packages("Rcpp")
install.packages("parallel")

# Work with Web
install.packages("XML")
install.packages("jsonlite")
install.packages("httr")

# Report Results
install.packages("shiny")
install.packages("rmarkdown")

# Text Mining
install.packages("tm")
install.packages("twitteR")

# Database 
install.packages("sqldf")

# Install unixodbc first before RODBC
#On a Mac run: brew install unixodbc
install.packages("RODBC")
install.packages("RMongo")

# Miscellaneous 
install.packages("swirl")
install.packages("reshape2")
install.packages("qcc")
install.packages("qdap")

STEP 5: RUN INVENTORY OF INSTALLED PACKAGES

# Get a full list of installed packages
write.csv(installed.packages(), file = "InstalledPackages.csv")

Nice visual categorisation of Data Science skills.

Source: http://nirvacana.com/thoughts/becoming-a-data-scientist/