gensim
gensim copied to clipboard
Coherence RuntimeWarnings : divide by zero encountered in double_scalars AND invalid value encountered in double_scalars
Problem description
For my intership, I'm trying to evaluate the quality of different LDA models using both perplexity and coherence. I used a loop and generated each model. However, I encounter a problem when the loop reaches parameter_value=15. The coherence score seems to be stored as nan for the models 15, 20, 25 and 30... I tried fixing this issue by changind the parameters in .LdaModel() but it only makes the warning appear for further models. Instead of having a warning for parameter_value=15, I get it for parameter_value=30.
Can someone please help me ?
Problem encountered : warning
starting pass for parameter_value = 30.000
Elapsed time: 1.6870347789972584
Perplexity score: -13.63168019880968
C:\Users\straw\Anaconda3\lib\site-packages\gensim\topic_coherence\direct_confirmation_measure.py:204: RuntimeWarning: divide by zero encountered in double_scalars
m_lr_i = np.log(numerator / denominator)
C:\Users\straw\Anaconda3\lib\site-packages\gensim\topic_coherence\indirect_confirmation_measure.py:323: RuntimeWarning: invalid value encountered in double_scalars
return cv1.T.dot(cv2)[0, 0] / (_magnitude(cv1) * _magnitude(cv2))
Coherence Score: nan
Steps/code
grid_flt = defaultdict(list)
# num topics
parameter_list=[2, 5, 10, 15, 20, 25, 30]
for parameter_value in parameter_list:
print("starting pass for parameter_value = %.3f" % parameter_value)
start_time = timeit.default_timer()
# run model
ldamodel_train_flt = gensim.models.ldamodel.LdaModel(corpus=doc_term_matrix_train_flt, id2word = dictionary_train_flt, num_topics = parameter_value, passes=25, per_word_topics=True)
# show elapsed time for model
elapsed = timeit.default_timer() - start_time
print("Elapsed time: %s" % elapsed)
# Compute perplexity
perplex = ldamodel_train_flt.log_perplexity(doc_term_matrix_test_flt)
print("Perplexity score: %s" % perplex)
grid_flt[parameter_value].append(perplex)
# Compute Coherence Score
coherence_model_lda = gensim.models.coherencemodel.CoherenceModel(model=ldamodel_train_flt, texts=list_of_docs_flt_test, dictionary=dictionary_train_flt, coherence='c_v')
coherence_lda = coherence_model_lda.get_coherence()
print("Coherence Score: %s" % coherence_lda)
grid_flt[parameter_value].append(coherence_lda)
Versions
Windows-10-10.0.17134-SP0
Python 3.6.5 |Anaconda, Inc.| (default, Mar 29 2018, 13:32:41) [MSC v.1900 64 bit (AMD64)]
NumPy 1.16.2
SciPy 1.1.0
gensim 3.7.2
FAST_VERSION -1
@Kaotic-Kiwi Can you please explain what happened here?
I am facing the exact same issue while using LDAMalletModel. Would you please provide the solution?
My function to create multiple models and stores multiple values in a list
def compute_coherence_score(dictionary, corpus, texts, limit, start, step):
"""Compute Coherence score for different values of num of topics"""
coherence_scores, model_list = [],[]
for num_topics in range(start,limit,step):
model = gensim.models.wrappers.LdaMallet(mallet_path=mallet_path, corpus = corpus,id2word=id2word, num_topics= num_topics)
model_list.append(model)
coherencescore = CoherenceModel(model=model, texts=texts, dictionary=dictionary, coherence='c_v')
coherence_scores.append(coherencescore.get_coherence())
return model_list, coherence_scores
**Function Call **
model_list , coherence_scores = compute_coherence_score(dictionary=id2word,texts=data_words,corpus=corpus,limit=100,start=50,step=10)
print(model_list)
print(coherence_scores)
Error Message: Resulting the coherence_scores =[nan,nan,nan,nan] (with all NaN values)
/usr/local/lib/python3.6/dist-packages/gensim/topic_coherence/direct_confirmation_measure.py:204: RuntimeWarning: divide by zero encountered in double_scalars
m_lr_i = np.log(numerator / denominator)
/usr/local/lib/python3.6/dist-packages/gensim/topic_coherence/indirect_confirmation_measure.py:323: RuntimeWarning: invalid value encountered in double_scalars
return cv1.T.dot(cv2)[0, 0] / (_magnitude(cv1) * _magnitude(cv2))
@kaotic-Kiwi @mpenkov Can you please inform if there is any solution for this?
I'm getting a similar error using the default gensim LDA implementation.
I did notice that there is a certain combination of topics (in LDA) and topn (in CoherenceModel) settings that can let the coherence calculation go through, for example if I have 30 topics and topn=2 I make it through the calculation.
Any thoughts? Perhaps this is a numerical stability issue?
ps: interestingly with the other window methods ‘c_uci’ and ‘c_npmi’ I get inf instead of nan
I am getting exactly the same error as @Kaotic-Kiwi when I try to calculate the coherence with c_v. Could somebody please help us or reopen the issue?
I'm wondering if this does not come from adding epsilon to the numerator rather than the denominator l.202-203 in topic_coherence/direct_confirmation_measure.py :
numerator = (co_occur_count / num_docs) + EPSILON
denominator = (w_prime_count / num_docs) * (w_star_count / num_docs)
m_lr_i = np.log(numerator / denominator)
Adding +EPSILON to the denominator removes the warning+NaN coherence result for me.
[EDIT]: does remove the first warning, but not the second (RuntimeWarning: invalid value encountered in double_scalars) - I'll look into this.
@HaukeT @aschoenauer-sebag the topic_coherence is a contributed module and its quality may be iffy.
If you're able to fix the issue and open a clean clear PR that'd be great.
Hello, I wrote another code, it seems to do the job for me. Apparently, using the parameter corpus instead of the parameter dictionary doesn't create any errors. I think coherence='c_v' doesn't like to be called with the dictionary parameter. I don't quite undertand why.
def LdaPipeline(train_set, test_set, k):
dictionary = gensim.corpora.Dictionary(train_set)
corpus_train = [dictionary.doc2bow(doc) for doc in train_set]
corpus_test = [dictionary.doc2bow(doc) for doc in test_set]
# LDA
lda_model = gensim.models.ldamodel.LdaModel(corpus=corpus_train, id2word = dictionary, num_topics = k, passes=30, alpha='auto')
# Perplexity
perplexity = lda_model.log_perplexity(corpus_test)
# Coherence
coherence_model = gensim.models.coherencemodel.CoherenceModel(model=lda_model, corpus=corpus_train, texts = train_set, coherence='c_v')
coherence = coherence_model.get_coherence()
return [perplexity, coherence]
Thank you for reopening the issue and the replies. The work around from @Kaotic-Kiwi to only use the parameter corpus and avoid the parameter dictionary did not work for my data. I will try to find an error in my data.
In my case, this error will happen when I try to pass my prior eta to the model. My eta is a numpy.ndarray with the shape of (num topics, num terms). I initialize eta with the value of 1/(num topics) and transfer some prior to top-n rows.
e.g. 3 topics and the first row is my prior:
[ [18, 63, 52, 5, 0, 145], [1/3, 1/3, 1/3, 1/3, 1/3, 1/3], [1/3, 1/3, 1/3, 1/3, 1/3, 1/3]]
with the more num of prior I transfer, the topic coherence will get nan in the calculation. (e.g. 30 topics, transfer 10)
Does not work for me either. I am using LDAMallet and using corpus instead of dictionary parameter as per @Kaotic-Kiwi advice did not help to solve the issue, unfortunately.
I get this error when switching to corpus parameter:
text_analysis.py in _ids_to_words(ids, dictionary)
55
56 """
---> 57 if not dictionary.id2token: # may not be initialized in the standard gensim.corpora.Dictionary
58 setattr(dictionary, 'id2token', {v: k for k, v in dictionary.token2id.items()})
59
AttributeError: 'dict' object has no attribute 'id2token'
Using u_mass solves the issue, although this is a different metric.
coherencemodel = CoherenceModel(model=model, texts=docs, corpus=corpus, coherence='u_mass')
@kdubovikov What is the full traceback?
I wonder if the dictionary in the code you show is allowed to be a plain dict, or must be gensim.corpora.Dictionary.
Has anyone found a solution to this problem? I'm still in the dark here. Using gensim version 3.8.3. When calculating coherence value over training data it all works fine. When calculating coherence value over the test data, it does give a nan value as output for about 50% of the topics, while the other topics are calculated properly.
In my case, the error is caused by certain topic words not appearing in the test datasets. No error after removing this word from the topic words.
Wouldn't that create unrepresentative coherence scores? @RayLei
in my case, the error is caused by several null text datasets (parameter texts). so, i cleanup texts datasets and rebuilt coherence model, finally get_coherence() return coherence score