Category Archives: Big Data

Configure Theano & CUDA for Deep Learning on a Mac

STEP 1 – Install Theano
After installing Anaconda run the following command in Terminal

$ conda install theano pygpu


STEP 2 – Install the correct CUDA driver based on your model of Mac
Browse to this link to download the correct version of the CUDA driver for your Mac. Make sure to click on the driver link and choose the Supported Products tab to determine the Mac hardware that the particular driver supports.

For some older Macs, the version 6.5.45 driver is the best choice.


STEP 3 – Install the CUDA toolkit, but don’t upgrade the driver
Download the right version of the CUDA toolkit for your Mac from the archive here.

For some older Macs, version 6.5 toolkit is the best choice


STEP 4 – Install XCode
Download the XCode app from the Apple site and install it on your Mac


STEP 5 – Install XCode Command Line tools
Open a Terminal and run the following command

$ xcode-select --install

Choose to install the command line tools


STEP 6 – Check the cc compiler
Open a Terminal and run the following command

$ /usr/bin/cc --version

 

STEP 7 – Update your .bash_profile file
Add the following to .bash_profile


export LD_LIBRARY_PATH=/Developer/NVIDIA/CUDA-6.5/lib/
export CUDA_ROOT=/Developer/NVIDIA/CUDA-6.5/
export THEANO_FLAGS='mode=FAST_RUN,device=gpu,floatX=float32'

export PATH=/Developer/NVIDIA/CUDA-6.5/bin${PATH:+:${PATH}}
export DYLD_LIBRARY_PATH=/Developer/NVIDIA/CUDA-6.5/lib\
${DYLD_LIBRARY_PATH:+:${DYLD_LIBRARY_PATH}}

 

STEP 8 – Test the NVCC compiler
Run the following in Terminal


$ /Developer/NVIDIA/CUDA-6.0/bin/nvcc -V

 

STEP 9 – Switch to the Samples Directory 
Switch to Samples directory that were installed as part of the toolkit


cd /Developer/NVIDIA/CUDA-6.0/samples/

 

STEP 10 – Make the Samples
Run the below, one line at a time and make sure you don’t get any errors


make -C 0_Simple/vectorAdd
make -C 0_Simple/vectorAddDrv
make -C 1_Utilities/deviceQuery
make -C 1_Utilities/bandwidthTest

 

STEP 11 – Run the Samples
Switch to the relevant directory to run the compiled files


cd /Developer/NVIDIA/CUDA-6.5/samples/bin/x86_64/darwin/release

Make sure you get the relevant output when running the below line by line


./deviceQuery
./bandwidthTest

 

STEP 12 – Configure Theano to use the GPU
Create a file called .theanorc in your HOME directory and add the following to it


[blas]
ldflags =

[global]
floatX = float32
device = gpu

[nvcc]
fastmath = True

[gcc]
cxxflags = -ID:\MinGW\include

[cuda]
# Set to where the cuda drivers are installed.
root=/usr/local/cuda/

 


STEP 13 – Run the following to confirm that Theano now uses the GPU


from theano import function, config, shared, tensor
import numpy
import time

vlen = 10 * 30 * 768  # 10 x #cores x # threads per core
iters = 1000

rng = numpy.random.RandomState(22)
x = shared(numpy.asarray(rng.rand(vlen), config.floatX))
f = function([], tensor.exp(x))
print(f.maker.fgraph.toposort())
t0 = time.time()
for i in range(iters):
    r = f()
t1 = time.time()
print("Looping %d times took %f seconds" % (iters, t1 - t0))
print("Result is %s" % (r,))
if numpy.any([isinstance(x.op, tensor.Elemwise) and
              ('Gpu' not in type(x.op).__name__)
              for x in f.maker.fgraph.toposort()]):
    print('Used the cpu')
else:
    print('Used the gpu')

Reading in Large Datasets in R

R can appear to run slow while processing larger data sets. Some tips on dealing with the issue of preprocessing Big Data in R are listed below:

TIP 1 – Set the colclasses Argument
While reading in large datasets using read.table(), R can seem to lag as it attempts to infer the class of each column in the background by scanning the dataset. This can slow down the process of reading in the dataset, especially with Big Data.

The colclasses argument in read.table() must be set for large datasets. Use the below code to figure out the classes for each of your columns. This only uses the first 100 rows to infer the column class:

initial <- read.table("datatable.txt", nrows=100)
classes <- sapply(initial, class)
tabAll <- read.table("datatable.txt", colClasses=classes)


TIP 2 – Set the nrows Argument
Set the nrows argument in read.table(). If you’re on a Mac or Linux use the command line utility wc to get an estimate of the number of lines/rows in your input file. This doesn’t have to exact, just close enough.

TIP 3 – Set the comment.char Argument
If there are no comment lines, then set the comment.char argument to be blank.

TIP 4 – System Contingency
When processing large datasets ensure that there is no resource contention on the same machine when using the RServer client, or in a distributed environment if you’ve chosen to use RStudio Server or Microsoft R Server.

TIP 5 – Making Decisions about RAM Restrictions
If the size of the dataset is larger than the memory available on your computer then you will not be able to process the data effectively in R. A 64-bit system will obviously be able to perform better than a 32-bit one.

There is a rough formula that can be used for calculating the amount of RAM required for R to hold the dataset in memory. The MAJOR assumption in the examples below is that all columns are of the numeric class and therefore use 8 bytes/numeric. I use this as a “gut feel” when planning the pre-processing of large datasets.

amount_of_RAM_in_GB <- (no_of_rows * no_of_columns * 8)/2^20/1024

Please note that the actual memory required by the computer to run effectively is roughly double the below required by R.

Rough Calculations:

  • 4GB RAM for 2.69M Rows, 200 Columns
  • 16GB RAM for 4.3M Rows, 500 Columns
  • 75GB RAM for 10M Rows, 1000 Columns
  • 15TB RAM for 1B Rows, 2000 Columns

TIP 6 – Remove NA values
Based on your use case you may make the decision to exclude missing values prior to developing algorithms for machine learning. I’ve followed an approach of extracting bad data (or column data with “NA”) into a separate dataset that I may choose to use for analysis at a later stage.

See the below example for extracting bad data and retaining clean data. This should work also a larger data set.

# Create dummy data for data frame
a <- c(1, 2, 3, 4, NA)
b <- c(6, 7, 8, NA, 10)
c <- c(11, 12, NA, 14, 15)
d <- c(16, NA, 18, 19, 20)
e <- c(21, 22, 23, 24, 25)

# Combine vectors to form a larger data frame
df <- data.frame(a, b, c, d, e)

# Append dataframes with row bind
rdf <- rbind(df, df, df) 

# Create new data frame with only the clean data, i,e; no NA
rdf_only_good_data <- na.omit(rdf)

# Create a new data frame with only bad data, i.e.; only NA
rdf_only_bad_data <- rdf[!complete.cases(rdf),] 

Output:

> rdf
    a  b  c  d  e
1   1  6 11 16 21
2   2  7 12 NA 22
3   3  8 NA 18 23
4   4 NA 14 19 24
5  NA 10 15 20 25
6   1  6 11 16 21
7   2  7 12 NA 22
8   3  8 NA 18 23
9   4 NA 14 19 24
10 NA 10 15 20 25
11  1  6 11 16 21
12  2  7 12 NA 22
13  3  8 NA 18 23
14  4 NA 14 19 24
15 NA 10 15 20 25

> rdf_only_good_data
   a b  c  d  e
1  1 6 11 16 21
6  1 6 11 16 21
11 1 6 11 16 21

> rdf_only_bad_data
    a  b  c  d  e
2   2  7 12 NA 22
3   3  8 NA 18 23
4   4 NA 14 19 24
5  NA 10 15 20 25
7   2  7 12 NA 22
8   3  8 NA 18 23
9   4 NA 14 19 24
10 NA 10 15 20 25
12  2  7 12 NA 22
13  3  8 NA 18 23
14  4 NA 14 19 24
15 NA 10 15 20 25