Processing 1 Terabyte of Text in 7 Seconds without Hadoop

Interesting post from Silvius Rus from the Cluster Team at Quantcast.

He implemented a simple Sawzall program to process 1 TB of text in 7 seconds starting from disk and did not use Hadoop (though he mentions that Quantcast’s proprietary MapReduce cluster is loosely based on Hadoop).

He also made a very interesting design decision –  to drop the sort phase of MapReduce and run the Reducer concurrently with the Mapper.

Useful Hadoop and MapReduce Algorithms

I’ve chosen to reproduce the below list from Amund Tveit’s article so I can maintain a backed-up personal reference.

I also intend to update this collection of Hadoop MapReduce algorithms based on my growing experience with the platform. 

Artificial Intelligence / Machine Learning / Data Mining

  1. NIMBLE: a toolkit for the implementation of parallel data mining and machine learning algorithms on map reduce 

  2. Distributed Evolutionary Algorithm Using the MapReduce Paradigm–A Case Study for Data Compaction Problem 

  3. On Using Pattern Matching Algorithms in MapReduce Applications

  4. Using Variational Inference and MapReduce to Scale Topic Modeling 

  5. A MapReduce-based distributed SVM algorithm for automatic image annotation 

  6. Scalable and Parallel Boosting with MapReduce 

  7. Master-Slave Parallel Genetic Algorithm Based on MapReduce Using Cloud Computing  
  8. Fast clustering using MapReduce

  9. K-Means Clustering with Bagging and MapReduce

  10. In-situ MapReduce for Log Processing 

  11. Clustering Very Large Multi-dimensional Datasets with MapReduce 

  12. Large Scale Fuzzy pD* Reasoning Using MapReduce

  13. MapReduce network enabled algorithms for classification based on association rules 

  14. PARABLE: A PArallel RAndom-partition Based HierarchicaL ClustEring Algorithm for the MapReduce Framework

  15. A MapReduce based parallel SVM for large scale spam filtering 

  16. Clustering Systems with Kolmogorov Complexity and MapReduce  

Bioinformatics / Medical Informatics