Tag Archives: Hadoop

The Hadoop ecosystem visualised by Rich Taylor from Datameer.

Useful Hadoop and MapReduce Algorithms

I’ve chosen to reproduce the below list from Amund Tveit’s article so I can maintain a backed-up personal reference.

I also intend to update this collection of Hadoop MapReduce algorithms based on my growing experience with the platform. 

Artificial Intelligence / Machine Learning / Data Mining

  1. NIMBLE: a toolkit for the implementation of parallel data mining and machine learning algorithms on map reduce 

  2. Distributed Evolutionary Algorithm Using the MapReduce Paradigm–A Case Study for Data Compaction Problem 

  3. On Using Pattern Matching Algorithms in MapReduce Applications

  4. Using Variational Inference and MapReduce to Scale Topic Modeling 

  5. A MapReduce-based distributed SVM algorithm for automatic image annotation 

  6. Scalable and Parallel Boosting with MapReduce 

  7. Master-Slave Parallel Genetic Algorithm Based on MapReduce Using Cloud Computing  
  8. Fast clustering using MapReduce

  9. K-Means Clustering with Bagging and MapReduce

  10. In-situ MapReduce for Log Processing 

  11. Clustering Very Large Multi-dimensional Datasets with MapReduce 

  12. Large Scale Fuzzy pD* Reasoning Using MapReduce

  13. MapReduce network enabled algorithms for classification based on association rules 

  14. PARABLE: A PArallel RAndom-partition Based HierarchicaL ClustEring Algorithm for the MapReduce Framework

  15. A MapReduce based parallel SVM for large scale spam filtering 

  16. Clustering Systems with Kolmogorov Complexity and MapReduce  

Bioinformatics / Medical Informatics

Installing VMware Tools on a CentOS 5 VM

The first thing to do in the installation process is to add the CD/DVD ROM device within the VMware Fusion Settings for the CentOS VM. This should be done only after the VM is shut down.


Then start the virtual machine and enter the following commands into Terminal. These commands enable us to download a required C compiler and create a symbolic link to the CentOS kernel files.

$ yum install gcc gcc-c++ kernel-devel
$ sudo ln –s /usr/src/kernels/[kernel version] /usr/src/linux

Mount the right volume and uncompress installation files

$ sudo mount /dev/cdrom /mnt/cdrom
$ cd /mnt/cdrom
$ cp VMwareTools-[version].tar.gz /tmp
$ cd /tmp
$ sudo umount /mnt/cdrom
$ tar zxf VMwareTools-[version].tar.gz

Finally run the installation script

$ cd /tmp/vmware-tools-distrib
$ sudo ./vmware-install.pl

Finding files on CentOS 5

I’m using the CentOS Linux distro for the first time for some Hadoop Big Data work and am having fun rediscovering the powerful *NIX shell.

An initial challenge I faced was being unable to search for files in the operating system as the “mlocate” package is not installed on CentOS 5 by default.

The below commands download the mlocate package, create a daily cron job to index my system and run a search for any file or folder with the string “Hadoop” in the name.   

$ sudo yum install mlocate
$ sudo /etc/cron.daily/mlocate.cron
$ locate mlocate.cron
$ locate updatedb
$ locate Hadoop | more