Orange 3 Natural Language Processing Reusable Template

The below Natural Language workflow can be used to generate Topic Models from a monolingual corpus, along with their associated Word Clouds.

Latent Semantic Indexing, Latent Dirichlet Allocation and Hierarchical Dirichlet Process are the three techniques available for Topic Modelling in the Orange 3 toolkit.

From my experience, the most useful and relevant Topics were produced by the Latent Semantic Indexing (LSI) because of its ability to correlate semantically related terms that are latent in a collection of text. LSI employs a mathematical technique called singular value decomposition (SVD) to identify patterns in the relationships between the terms and concepts contained in an unstructured collection of text.

Useful Hadoop and MapReduce Algorithms

I’ve chosen to reproduce the below list from Amund Tveit’s article so I can maintain a backed-up personal reference.

I also intend to update this collection of Hadoop MapReduce algorithms based on my growing experience with the platform. 

