GuidedLDA
GuidedLDA copied to clipboard
Predict topics
Hello,
first of thanks for this great library! I managed to get the training working. But right now I'm struggling to predict the best matching topics for a given (single) document.
I already tried doc_topics = model.transform(Z), but how do I now get the probabilities for the (e.g. 7) topics?
Thanks!
Hi @berndartmueller, the model.transform(X) method itself returns the probability distribution.
I added the following lines in the example code and it worked as expected:
print("\nPredicting topic for the first document")
doc_topic = model.transform(X[0,:]) # predict the labels the first document
print(doc_topic)
out: [[3.97730781e-05 1.86927840e-01 2.05632359e-02 4.18205495e-03
7.88287096e-01]]
As commented in the docstring of transform method,
Returns
-------
doc_topic : array-like, shape (n_samples, n_topics)
Point estimate of the document-topic distributions
Hi @berndartmueller , To predict topics for new documents we could use model.fit_transform(dtm) method. It worked when I used it to predict incoming documents based on the trained model
Hi @vi3k6i5 @berndartmueller,@Praveenrajan27,@YipingNUS Can some one help me understand whether the new predict input text data(words) should already exist in the dictionary?
I'm using below code to convert gensim data to doc_term matrix
from gensim import matutils from gensim.matutils import corpus2csc
def bow_iterator(docs, dictionary): for doc in docs: yield dictionary.doc2bow(doc)
def get_term_matrix(msgs, dictionary): bow = bow_iterator(msgs, dictionary) X = np.transpose(matutils.corpus2csc(bow).astype(np.int64)) return X
X = get_term_matrix(bigram_train, train_id2word)
For predicting:
X_test = get_term_matrix([['new','travles','comfort']], train_id2word) y_pred = model.fit_transform(X_test)
while predicting for test input i'm getting error as x is not positive value
@ImSajeed, yes you need to make sure you use the same vocab for training and prediction. In sklearn, that would correspond to fit_transform for training and transform for test/prediction.
https://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html
btw, I found GuidedLDA is good for inferring topics, but it does a poor job in classification. The following repo works much better. The downside is that it's much harder to set up. It took me two weeks to refactor it so that it can predict for new documents (the original repo requires all documents to be indexed in Lucene up-front).
https://github.com/WHUIR/STM
Hi @vi3k6i5 @YipingNUS @Praveenrajan27 , could you please help on the below issue.
I'm facing issue while predicting the topics for new documents using y_pred = model.fit(X_test) or y_pred = model.fit_transform(X_test)
y_pred = model.fit(X_test) - giving irrelevant topics distribution
y_pred = model.fit_transform(X_test) - Not matching with correct existing topics
But the same model is predicting the right topics for the trained documents using y_pred = model.fit_transform(X_test) , but not working for new documents.
Please let me know the right way of predicting topics for new document.
code below
X_test = get_term_matrix([['blankets not','not clean']], train_id2word) y_pred = model.fit_transform(X_test)
@ImSajeed, below is my code that worked. You should use transform instead.
def predict_prob(text):
""" return the probability vector for the input text to belong to each of the topics
"""
text_vec = tf_vectorizer.transform([text])
doc_topic = seeded_model.transform(text_vec)
return doc_topic
Hi @vi3k6i5 @YipingNUS
Could you please let me know the importance of refresh param used in GuidedLDA