GuidedLDA icon indicating copy to clipboard operation
GuidedLDA copied to clipboard

Is there a way to transform a gensim corpus to work with guidedlda?

Open laurenblau opened this issue 6 years ago • 4 comments

I have a gensim corpus, can I transform it somehow to use with guidedlda?

laurenblau avatar Jan 11 '18 16:01 laurenblau

I don't know .. Haven't tried that yet.. does gensim corpus gives a document term count vector ??

Is there a method for that ??

On Thu 11 Jan, 2018, 21:50 laurenblau, [email protected] wrote:

I have a gensim corpus, can I transform it somehow to use with guidedlda?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/vi3k6i5/GuidedLDA/issues/11, or mute the thread https://github.com/notifications/unsubscribe-auth/AC-Nwh0laQYgHkC0JWqqOrPA-vbZeiTfks5tJjTogaJpZM4RbFLg .

vi3k6i5 avatar Jan 11 '18 16:01 vi3k6i5

I used gensim.matutils.corpus2csc to create a sparse matrix from gensim's bow document representation, which I then used to train the guidedLda.

def bow_iterator(docs, dictionary):
    for doc in docs:
        yield dictionary.doc2bow(doc)

def get_term_matrix(msgs, dictionary):
    bow = bow_iterator(msgs, dictionary)
    X = np.transpose(matutils.corpus2csc(bow).astype(np.int64))
    return X

X = get_term_matrix(train_cleaned, dictionary)

model = guidedlda.GuidedLDA(alpha=.1, n_topics=NUM_TOPICS, n_iter=300, random_state=7, refresh=20)
model.fit(X, seed_topics=seed_topics, seed_confidence=0.6)

Hope that helps!

artichox avatar Feb 13 '18 22:02 artichox

hi artichox,

first of all, thank you for your solution proposal but there are many variables i do not understand

I used gensim.matutils.corpus2csc to create a sparse matrix from gensim's bow document representation, which I then used to train the guidedLda.

this is fine

def bow_iterator(docs, dictionary):
    for doc in docs:
        yield dictionary.doc2bow(doc)

i assume yield dictionary.doc2bow(doc) outputs a "XY.dict" file?

def get_term_matrix(msgs, dictionary): bow = bow_iterator(msgs, dictionary) X = np.transpose(matutils.corpus2csc(bow).astype(np.int64)) return X

what does "msgs" stand for? And what does the function expect as dictionary input? a .dict-file? Does the bow_iterator just transform a document into the bag of words format with the help of a dictionary?

X = get_term_matrix(train_cleaned, dictionary)

What do you mean with "train_cleaned"? And how is get_term_matrix defined? To which library does it belong to?

Thank you very much in advance for answering my questions!

shassanin avatar May 23 '19 11:05 shassanin

@shassanin Hi mate! I may not be the original author of the code but I use her code in my research.

def bow_iterator(docs, dictionary):
    for doc in docs:
        yield dictionary.doc2bow(doc)

According to gensim's documentation, doc2bow expects 2 parameters: docs, the text you want to have the BoW representation, and dictionary, a Dictionary instance of your corpus. doc2bow's output is in the form of list of (token_id, token_count) tuples, unlike the usual BoW representation you see in tutorials. This function returns an iterable of the BoW representation but generated by a generator (yield) so it doesn't have to be stored in the memory twice.

def get_term_matrix(msgs, dictionary):
      bow = bow_iterator(msgs, dictionary)
      X = np.transpose(matutils.corpus2csc(bow).astype(np.int64))
      return X

The variable msgs in here corresponds to docs in the bow_iterator function, the document you are going to transform into a BoW.

For variable train_cleaned, I assume it's the documents that have been preprocessed and the OP wants to transform them into BoW.

Just for an example:

text = ["she likes eating a lot"]
tokens = [doc.split() for doc in text]
dictionary = Dictionary(tokens)

bow = get_term_matrix(tokens, dictionary)

I hope it helps!

perambulate avatar May 28 '19 14:05 perambulate