GuidedLDA
GuidedLDA copied to clipboard
Is there a way to transform a gensim corpus to work with guidedlda?
I have a gensim corpus, can I transform it somehow to use with guidedlda?
I don't know .. Haven't tried that yet.. does gensim corpus gives a document term count vector ??
Is there a method for that ??
On Thu 11 Jan, 2018, 21:50 laurenblau, [email protected] wrote:
I have a gensim corpus, can I transform it somehow to use with guidedlda?
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/vi3k6i5/GuidedLDA/issues/11, or mute the thread https://github.com/notifications/unsubscribe-auth/AC-Nwh0laQYgHkC0JWqqOrPA-vbZeiTfks5tJjTogaJpZM4RbFLg .
I used gensim.matutils.corpus2csc to create a sparse matrix from gensim's bow document representation, which I then used to train the guidedLda.
def bow_iterator(docs, dictionary):
for doc in docs:
yield dictionary.doc2bow(doc)
def get_term_matrix(msgs, dictionary):
bow = bow_iterator(msgs, dictionary)
X = np.transpose(matutils.corpus2csc(bow).astype(np.int64))
return X
X = get_term_matrix(train_cleaned, dictionary)
model = guidedlda.GuidedLDA(alpha=.1, n_topics=NUM_TOPICS, n_iter=300, random_state=7, refresh=20)
model.fit(X, seed_topics=seed_topics, seed_confidence=0.6)
Hope that helps!
hi artichox,
first of all, thank you for your solution proposal but there are many variables i do not understand
I used gensim.matutils.corpus2csc to create a sparse matrix from gensim's bow document representation, which I then used to train the guidedLda.
this is fine
def bow_iterator(docs, dictionary): for doc in docs: yield dictionary.doc2bow(doc)
i assume yield dictionary.doc2bow(doc) outputs a "XY.dict" file?
def get_term_matrix(msgs, dictionary): bow = bow_iterator(msgs, dictionary) X = np.transpose(matutils.corpus2csc(bow).astype(np.int64)) return X
what does "msgs" stand for? And what does the function expect as dictionary input? a .dict-file? Does the bow_iterator just transform a document into the bag of words format with the help of a dictionary?
X = get_term_matrix(train_cleaned, dictionary)
What do you mean with "train_cleaned"? And how is get_term_matrix defined? To which library does it belong to?
Thank you very much in advance for answering my questions!
@shassanin Hi mate! I may not be the original author of the code but I use her code in my research.
def bow_iterator(docs, dictionary):
for doc in docs:
yield dictionary.doc2bow(doc)
According to gensim's documentation, doc2bow expects 2 parameters: docs, the text you want to have the BoW representation, and dictionary, a Dictionary instance of your corpus. doc2bow's output is in the form of list of (token_id, token_count) tuples, unlike the usual BoW representation you see in tutorials. This function returns an iterable of the BoW representation but generated by a generator (yield) so it doesn't have to be stored in the memory twice.
def get_term_matrix(msgs, dictionary):
bow = bow_iterator(msgs, dictionary)
X = np.transpose(matutils.corpus2csc(bow).astype(np.int64))
return X
The variable msgs
in here corresponds to docs
in the bow_iterator
function, the document you are going to transform into a BoW.
For variable train_cleaned, I assume it's the documents that have been preprocessed and the OP wants to transform them into BoW.
Just for an example:
text = ["she likes eating a lot"]
tokens = [doc.split() for doc in text]
dictionary = Dictionary(tokens)
bow = get_term_matrix(tokens, dictionary)
I hope it helps!