GuidedLDA icon indicating copy to clipboard operation
GuidedLDA copied to clipboard

Predict topics

Open berndartmueller opened this issue 6 years ago • 7 comments

Hello,

first of thanks for this great library! I managed to get the training working. But right now I'm struggling to predict the best matching topics for a given (single) document.

I already tried doc_topics = model.transform(Z), but how do I now get the probabilities for the (e.g. 7) topics?

Thanks!

berndartmueller avatar Mar 31 '18 13:03 berndartmueller

Hi @berndartmueller, the model.transform(X) method itself returns the probability distribution.

I added the following lines in the example code and it worked as expected:

print("\nPredicting topic for the first document")
doc_topic = model.transform(X[0,:])  # predict the labels the first document
print(doc_topic) 

out: [[3.97730781e-05 1.86927840e-01 2.05632359e-02 4.18205495e-03
  7.88287096e-01]]

As commented in the docstring of transform method,

        Returns
        -------
        doc_topic : array-like, shape (n_samples, n_topics)
            Point estimate of the document-topic distributions

YipingNUS avatar May 04 '18 15:05 YipingNUS

Hi @berndartmueller , To predict topics for new documents we could use model.fit_transform(dtm) method. It worked when I used it to predict incoming documents based on the trained model

Praveenrajan27 avatar Feb 26 '19 10:02 Praveenrajan27

Hi @vi3k6i5 @berndartmueller,@Praveenrajan27,@YipingNUS Can some one help me understand whether the new predict input text data(words) should already exist in the dictionary?

I'm using below code to convert gensim data to doc_term matrix

from gensim import matutils from gensim.matutils import corpus2csc

def bow_iterator(docs, dictionary): for doc in docs: yield dictionary.doc2bow(doc)

def get_term_matrix(msgs, dictionary): bow = bow_iterator(msgs, dictionary) X = np.transpose(matutils.corpus2csc(bow).astype(np.int64)) return X

X = get_term_matrix(bigram_train, train_id2word)

For predicting:

X_test = get_term_matrix([['new','travles','comfort']], train_id2word) y_pred = model.fit_transform(X_test)

while predicting for test input i'm getting error as x is not positive value

ImSajeed avatar May 20 '19 14:05 ImSajeed

@ImSajeed, yes you need to make sure you use the same vocab for training and prediction. In sklearn, that would correspond to fit_transform for training and transform for test/prediction.

https://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html

btw, I found GuidedLDA is good for inferring topics, but it does a poor job in classification. The following repo works much better. The downside is that it's much harder to set up. It took me two weeks to refactor it so that it can predict for new documents (the original repo requires all documents to be indexed in Lucene up-front).

https://github.com/WHUIR/STM

YipingNUS avatar May 21 '19 01:05 YipingNUS

Hi @vi3k6i5 @YipingNUS @Praveenrajan27 , could you please help on the below issue.

I'm facing issue while predicting the topics for new documents using y_pred = model.fit(X_test) or y_pred = model.fit_transform(X_test)

y_pred = model.fit(X_test) - giving irrelevant topics distribution

y_pred = model.fit_transform(X_test) - Not matching with correct existing topics

But the same model is predicting the right topics for the trained documents using y_pred = model.fit_transform(X_test) , but not working for new documents.

Please let me know the right way of predicting topics for new document.

code below

X_test = get_term_matrix([['blankets not','not clean']], train_id2word) y_pred = model.fit_transform(X_test)

ImSajeed avatar May 22 '19 17:05 ImSajeed

@ImSajeed, below is my code that worked. You should use transform instead.

def predict_prob(text):
    """ return the probability vector for the input text to belong to each of the topics
    """
    text_vec = tf_vectorizer.transform([text])
    doc_topic = seeded_model.transform(text_vec)
    return doc_topic

YipingNUS avatar May 23 '19 01:05 YipingNUS

Hi @vi3k6i5 @YipingNUS

Could you please let me know the importance of refresh param used in GuidedLDA

ImSajeed avatar May 25 '19 08:05 ImSajeed