Bayesian Nonparametric

Bayesian Nonparametric models with Python.

Models follow scikit-learn's API and can be used as its extension.

Current model:

Hierarchical Dirichlet Process

HDP is similar to LDA (Latent Direchlet Allocation) but assumes an "infinite" number of topics. This implementation is based on Chong Wang's online-hdp and optimized with cython.

Reference:

"Stochastic Variational Inference", Matthew D. Hoffman, David M. Blei, Chong Wang, John Paisley, 2013
"Online Variational Inference for the Hierarchical Dirichlet Process", Chong Wang, John Paisley, David M. Blei, 2011
Chong Wang's online-hdp code.

Install:

# clone repoisitory
git clone [email protected]:chyikwei/bnp.git
cd bnp

# install dependencies (cython, numpy, scipy, scikit-learn)
pip install -r requirements.txt
pip install .

Getting started:

In bnp.utils we proivde a function to generate fake document-word matrix with hidden topics. We will run our HDP model with it.

First, we can generate a document-word matrix with 5 hidden topics. (each topic has 10 uniuque words and each topic has 100 docs.)

>>> from __future__ import print_function
>>> from bnp.online_hdp import HierarchicalDirichletProcess
>>> from bnp.utils import make_doc_word_matrix

>>> tf = make_doc_word_matrix(n_topics=5,
...                           words_per_topic=10,
...                           docs_per_topic=100,
...                           words_per_doc=20,
...                           shuffle=True,
...                           random_state=0)
>>> tf.shape
(500, 50)

For samples in the matrix, each row(document) only contains words from a specific topic (word 0 to 9: topic 1, 10 to 19: topic 2,...)

>>> tf[0].toarray()
array([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 2, 2, 1, 4, 1, 2, 3, 3, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0]])
>>> tf[1].toarray()
array([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 3, 2, 3, 1, 3, 2, 1, 2, 0, 3, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0]])

Next we fit a HDP model with this matrix

>>> hdp = HierarchicalDirichletProcess(n_topic_truncate=10,
...                                    n_doc_truncate=3,
...                                    max_iter=5,
...                                    random_state=0)
>>> hdp.fit(tf)

Then we can print out topic proportion and top topic words in HDP model.

# print topic function
>>> def print_top_words(model, n_words):
...     topic_distr = model.topic_distribution()
...     for topic_idx in range(model.lambda_.shape[0]):
...         topic = model.lambda_[topic_idx, :]
...         message = "Topic %d (proportion: %.2f): " % (topic_idx, topic_distr[topic_idx])
...         message += " ".join([str(i) for i in topic.argsort()[:-n_words - 1:-1]])
...         print(message)

>>> print_top_words(hdp, 10)
Topic 0 (proportion: 0.20): 3 1 7 5 8 4 0 2 9 6
Topic 1 (proportion: 0.00): 49 12 22 21 20 19 18 17 16 15
Topic 2 (proportion: 0.04): 43 49 44 45 47 40 46 48 41 42
Topic 3 (proportion: 0.13): 14 18 10 15 16 12 17 19 11 13
Topic 4 (proportion: 0.07): 19 16 10 15 11 17 12 13 18 14
Topic 5 (proportion: 0.01): 23 29 28 20 21 25 26 24 27 22
Topic 6 (proportion: 0.01): 31 38 35 39 30 33 34 37 32 36
Topic 7 (proportion: 0.19): 35 31 39 30 33 38 32 34 36 37
Topic 8 (proportion: 0.16): 48 42 46 49 45 47 41 44 40 43
Topic 9 (proportion: 0.19): 21 29 28 23 20 24 26 27 25 22

Here HDP find 7 large topics (> 1%) and those can map to the hidden topics we generated before.

Examples

In bnp/examples folder. (Will add ipython notebook soon)

Running Test:

python setup.py test

Uninstall:

pip uninstall bnp

bnp
bnp copied to clipboard

Metadata

Bayesian Nonparametric

Current model:

Reference:

Install:

Getting started:

Examples

Running Test:

Uninstall:

← Metadata

Owner

Metadata

bnp bnp copied to clipboard

Metadata

Bayesian Nonparametric

Current model:

Reference:

Install:

Getting started:

Examples

Running Test:

Uninstall:

← Metadata

Owner

Metadata

bnp
bnp copied to clipboard