Text analytics on MPP

Collection of tutorials on text analytics/NLP, including vector space models, neural language models and topic models on the Pivotal MPP platform (Greenplum/HAWQ).

Vector space models

1. Tokenization, stemming, unigrams, bigrams, trigram and skipgrams generation. 
2. Bag-of-words model for classification on 20-news-groups dataset.
3. tf-idf weighting for classification on 20-news-groups dataset.
4. Feature hashing for classification on 20-news-groups dataset.
5. Grid search on model parameters for Elastic Net on the  tf-idf representation

Topic models

1. LDA topic models on 20-news-groups dataset.
2. Grid search for LDA hyperparameters, on the 20-news-groups dataset.

Neural Language Models (Paragraph Vectors, Word2Vec etc.)

1. Classification models using Paragraph vector representation of 20-news-groups dataset using `doc2vec` package in `gensim`.

Dependencies

These exercises have the following client and server side dependencies:

Client side: We encourage you to install Anaconda Python for your Jupyter Notebooks. The notebooks in these exercises use matplotlib and seaborn for data visualization, pandas and psycopg2 to query the backend database.
Server side: On the server side, you'll need to install sklearn (and its dependencies).

Note

These notebooks have been uploaded only to show code snippets, it is not meant to be a complete tutorial as is a narration that accompanies these exercises.

text_analytics_on_mpp
text_analytics_on_mpp copied to clipboard

Metadata

Text analytics on MPP

Vector space models

Topic models

Neural Language Models (Paragraph Vectors, Word2Vec etc.)

Dependencies

Note

← Metadata

Owner

Metadata

text_analytics_on_mpp text_analytics_on_mpp copied to clipboard

Metadata

Text analytics on MPP

Vector space models

Topic models

Neural Language Models (Paragraph Vectors, Word2Vec etc.)

Dependencies

Note

← Metadata

Owner

Metadata

text_analytics_on_mpp
text_analytics_on_mpp copied to clipboard