GuidedLDA icon indicating copy to clipboard operation
GuidedLDA copied to clipboard

Using pyldavis on top of guidedlda model

Open Jeeva-G opened this issue 5 years ago • 3 comments

Hi,

Is there a way to use pyldavis library on top of the model fit by guidedlda to visualise the topic clusters ?

Thanks.

Jeeva-G avatar Jan 26 '19 23:01 Jeeva-G

Hello Jeeva - I just implemented this on a project I was working on. Tested in Anconda notebooks for inline visualisation

[Assuming you have the document-term matrix stored as a pandas dataframe (in my case "tef_dtm"), a guidedLDA model built (in my case named "model"), and the vocab saved as a variable (in my case named "vocab")]

import pyLDAvis

# calculate doc lengths as the sum of each row of the dtm
doc_lengths = tef_dtm.sum(axis=1, skipna=True)

# transpose the dtm and get a sum of the overall term frequency
dtm_trans = tef_dtm.T
dtm_trans['total'] = dtm_trans.sum(axis=1, skipna=True)

# create a data dictionary as per this tutorial https://nbviewer.jupyter.org/github/bmabey/pyLDAvis/blob/master/notebooks/Movie%20Reviews%2C%20AP%20News%2C%20and%20Jeopardy.ipynb
data = {'topic_term_dists':model.topic_word_, 'doc_topic_dists':model.doc_topic_, 'doc_lengths':doc_lengths, 'vocab':vocab, 'term_frequency':list(dtm_trans['total'])}

# prepare the data
tef_vis_data = pyLDAvis.prepare(**data)

# this bit needs to be run after running the earlier code for reasons
pyLDAvis.display(tef_vis_data)

# save to HTML
pyLDAvis.save_html(tef_vis_data, "LDAvis.html")

MJMortenson avatar May 09 '19 15:05 MJMortenson

Hi

Hello Jeeva - I just implemented this on a project I was working on. Tested in Anconda notebooks for inline visualisation

[Assuming you have the document-term matrix stored as a pandas dataframe (in my case "tef_dtm"), a guidedLDA model built (in my case named "model"), and the vocab saved as a variable (in my case named "vocab")]

import pyLDAvis

# calculate doc lengths as the sum of each row of the dtm
doc_lengths = tef_dtm.sum(axis=1, skipna=True)

# transpose the dtm and get a sum of the overall term frequency
dtm_trans = tef_dtm.T
dtm_trans['total'] = dtm_trans.sum(axis=1, skipna=True)

# create a data dictionary as per this tutorial https://nbviewer.jupyter.org/github/bmabey/pyLDAvis/blob/master/notebooks/Movie%20Reviews%2C%20AP%20News%2C%20and%20Jeopardy.ipynb
data = {'topic_term_dists':model.topic_word_, 'doc_topic_dists':model.doc_topic_, 'doc_lengths':doc_lengths, 'vocab':vocab, 'term_frequency':list(dtm_trans['total'])}

# prepare the data
tef_vis_data = pyLDAvis.prepare(**data)

# this bit needs to be run after running the earlier code for reasons
pyLDAvis.display(tef_vis_data)

# save to HTML
pyLDAvis.save_html(tef_vis_data, "LDAvis.html")

Hey thanks it would be great if you could add a few lines in order to make your code reproducible. Not sure what is vocab here and if for document-term matrix you mean the matrix that contains the frequencies, or simply the counts

famargar avatar Oct 18 '20 18:10 famargar

Hello @famargar

Vocab would just be the vocabulary stored as a list, and is the same variable name from the main tutorial. Not sure what the difference is between word frequencies and word counts in this context. A DTM is standardly a matrix of N x M where N is each document as a row, M is a column for each word in the vocabulary, and each item is the per-document per-word frequency/count.

Hope that helps

MJMortenson avatar Oct 18 '20 19:10 MJMortenson