gensim Coherence RuntimeWarnings : divide by zero encountered in double_scalars AND invalid value encountered in double

Problem description

For my intership, I'm trying to evaluate the quality of different LDA models using both perplexity and coherence. I used a loop and generated each model. However, I encounter a problem when the loop reaches parameter_value=15. The coherence score seems to be stored as nan for the models 15, 20, 25 and 30... I tried fixing this issue by changind the parameters in .LdaModel() but it only makes the warning appear for further models. Instead of having a warning for parameter_value=15, I get it for parameter_value=30.

Can someone please help me ?

Problem encountered : warning

starting pass for parameter_value = 30.000
Elapsed time: 1.6870347789972584
Perplexity score: -13.63168019880968
C:\Users\straw\Anaconda3\lib\site-packages\gensim\topic_coherence\direct_confirmation_measure.py:204: RuntimeWarning: divide by zero encountered in double_scalars
  m_lr_i = np.log(numerator / denominator)
C:\Users\straw\Anaconda3\lib\site-packages\gensim\topic_coherence\indirect_confirmation_measure.py:323: RuntimeWarning: invalid value encountered in double_scalars
  return cv1.T.dot(cv2)[0, 0] / (_magnitude(cv1) * _magnitude(cv2))
Coherence Score: nan

Steps/code


grid_flt = defaultdict(list)

 # num topics
parameter_list=[2, 5, 10, 15, 20, 25, 30]


for parameter_value in parameter_list:
    print("starting pass for parameter_value = %.3f" % parameter_value)
    start_time = timeit.default_timer()
    # run model
    ldamodel_train_flt = gensim.models.ldamodel.LdaModel(corpus=doc_term_matrix_train_flt, id2word = dictionary_train_flt, num_topics = parameter_value, passes=25, per_word_topics=True)

    # show elapsed time for model
    elapsed = timeit.default_timer() - start_time
    print("Elapsed time: %s" % elapsed)
    
    # Compute perplexity
    perplex =  ldamodel_train_flt.log_perplexity(doc_term_matrix_test_flt)
    print("Perplexity score: %s" % perplex)
    grid_flt[parameter_value].append(perplex)
    
    # Compute Coherence Score
    coherence_model_lda = gensim.models.coherencemodel.CoherenceModel(model=ldamodel_train_flt, texts=list_of_docs_flt_test, dictionary=dictionary_train_flt, coherence='c_v')
    coherence_lda = coherence_model_lda.get_coherence()
    print("Coherence Score: %s" % coherence_lda)
    grid_flt[parameter_value].append(coherence_lda)

Versions

Windows-10-10.0.17134-SP0
Python 3.6.5 |Anaconda, Inc.| (default, Mar 29 2018, 13:32:41) [MSC v.1900 64 bit (AMD64)]
NumPy 1.16.2
SciPy 1.1.0
gensim 3.7.2
FAST_VERSION -1

Apr 24 '19 12:04 eltarrk

@Kaotic-Kiwi Can you please explain what happened here?

May 03 '19 10:05 mpenkov

I am facing the exact same issue while using LDAMalletModel. Would you please provide the solution?

My function to create multiple models and stores multiple values in a list

def compute_coherence_score(dictionary, corpus, texts, limit, start, step):
  """Compute Coherence score for different values of num of topics"""
  coherence_scores, model_list = [],[]
  for num_topics in range(start,limit,step):
    model = gensim.models.wrappers.LdaMallet(mallet_path=mallet_path, corpus = corpus,id2word=id2word, num_topics= num_topics)
    model_list.append(model)
    coherencescore = CoherenceModel(model=model, texts=texts, dictionary=dictionary, coherence='c_v')
    coherence_scores.append(coherencescore.get_coherence())

  return model_list, coherence_scores

**Function Call **

model_list , coherence_scores = compute_coherence_score(dictionary=id2word,texts=data_words,corpus=corpus,limit=100,start=50,step=10)
print(model_list)
print(coherence_scores)

Error Message: Resulting the coherence_scores =[nan,nan,nan,nan] (with all NaN values)

/usr/local/lib/python3.6/dist-packages/gensim/topic_coherence/direct_confirmation_measure.py:204: RuntimeWarning: divide by zero encountered in double_scalars
  m_lr_i = np.log(numerator / denominator)
/usr/local/lib/python3.6/dist-packages/gensim/topic_coherence/indirect_confirmation_measure.py:323: RuntimeWarning: invalid value encountered in double_scalars
  return cv1.T.dot(cv2)[0, 0] / (_magnitude(cv1) * _magnitude(cv2))

@kaotic-Kiwi @mpenkov Can you please inform if there is any solution for this?

Feb 26 '20 19:02 harshshah-work

I'm getting a similar error using the default gensim LDA implementation.

I did notice that there is a certain combination of topics (in LDA) and topn (in CoherenceModel) settings that can let the coherence calculation go through, for example if I have 30 topics and topn=2 I make it through the calculation.

Any thoughts? Perhaps this is a numerical stability issue?

ps: interestingly with the other window methods ‘c_uci’ and ‘c_npmi’ I get inf instead of nan

Apr 21 '20 00:04 NickRothbacher

I am getting exactly the same error as @Kaotic-Kiwi when I try to calculate the coherence with c_v. Could somebody please help us or reopen the issue?

May 25 '20 00:05 ContainerEnjoyer

I'm wondering if this does not come from adding epsilon to the numerator rather than the denominator l.202-203 in topic_coherence/direct_confirmation_measure.py :

numerator = (co_occur_count / num_docs) + EPSILON
denominator = (w_prime_count / num_docs) * (w_star_count / num_docs)
m_lr_i = np.log(numerator / denominator)

Adding +EPSILON to the denominator removes the warning+NaN coherence result for me. [EDIT]: does remove the first warning, but not the second (RuntimeWarning: invalid value encountered in double_scalars) - I'll look into this.

May 25 '20 09:05 aschoenauer-sebag

@HaukeT @aschoenauer-sebag the topic_coherence is a contributed module and its quality may be iffy.

If you're able to fix the issue and open a clean clear PR that'd be great.

May 25 '20 10:05 piskvorky

Hello, I wrote another code, it seems to do the job for me. Apparently, using the parameter corpus instead of the parameter dictionary doesn't create any errors. I think coherence='c_v' doesn't like to be called with the dictionary parameter. I don't quite undertand why.

def LdaPipeline(train_set, test_set, k):

dictionary = gensim.corpora.Dictionary(train_set)
corpus_train = [dictionary.doc2bow(doc) for doc in train_set]
corpus_test = [dictionary.doc2bow(doc) for doc in test_set] 
# LDA
lda_model = gensim.models.ldamodel.LdaModel(corpus=corpus_train, id2word = dictionary, num_topics = k, passes=30, alpha='auto')
# Perplexity
perplexity = lda_model.log_perplexity(corpus_test)
# Coherence
coherence_model = gensim.models.coherencemodel.CoherenceModel(model=lda_model, corpus=corpus_train, texts = train_set, coherence='c_v')
coherence = coherence_model.get_coherence()
return [perplexity, coherence]

May 25 '20 10:05 eltarrk

Thank you for reopening the issue and the replies. The work around from @Kaotic-Kiwi to only use the parameter corpus and avoid the parameter dictionary did not work for my data. I will try to find an error in my data.

May 27 '20 21:05 ContainerEnjoyer

In my case, this error will happen when I try to pass my prior eta to the model. My eta is a numpy.ndarray with the shape of (num topics, num terms). I initialize eta with the value of 1/(num topics) and transfer some prior to top-n rows. e.g. 3 topics and the first row is my prior: [ [18, 63, 52, 5, 0, 145], [1/3, 1/3, 1/3, 1/3, 1/3, 1/3], [1/3, 1/3, 1/3, 1/3, 1/3, 1/3]] with the more num of prior I transfer, the topic coherence will get nan in the calculation. (e.g. 30 topics, transfer 10)

Dec 10 '20 06:12 rocknamx8

Does not work for me either. I am using LDAMallet and using corpus instead of dictionary parameter as per @Kaotic-Kiwi advice did not help to solve the issue, unfortunately.

I get this error when switching to corpus parameter:

text_analysis.py in _ids_to_words(ids, dictionary)
     55 
     56     """
---> 57     if not dictionary.id2token:  # may not be initialized in the standard gensim.corpora.Dictionary
     58         setattr(dictionary, 'id2token', {v: k for k, v in dictionary.token2id.items()})
     59 

AttributeError: 'dict' object has no attribute 'id2token'

Using u_mass solves the issue, although this is a different metric.

coherencemodel = CoherenceModel(model=model, texts=docs, corpus=corpus, coherence='u_mass')

Dec 18 '20 07:12 kdubovikov

@kdubovikov What is the full traceback?

I wonder if the dictionary in the code you show is allowed to be a plain dict, or must be gensim.corpora.Dictionary.

Dec 18 '20 10:12 piskvorky

Has anyone found a solution to this problem? I'm still in the dark here. Using gensim version 3.8.3. When calculating coherence value over training data it all works fine. When calculating coherence value over the test data, it does give a nan value as output for about 50% of the topics, while the other topics are calculated properly.

Feb 04 '21 16:02 job-almekinders

In my case, the error is caused by certain topic words not appearing in the test datasets. No error after removing this word from the topic words.

Feb 06 '21 06:02 RayLei

Wouldn't that create unrepresentative coherence scores? @RayLei

Feb 06 '21 10:02 job-almekinders

in my case, the error is caused by several null text datasets (parameter texts). so, i cleanup texts datasets and rebuilt coherence model, finally get_coherence() return coherence score

Feb 06 '22 15:02 ekopermonojati

gensim
gensim copied to clipboard

Coherence RuntimeWarnings : divide by zero encountered in double_scalars AND invalid value encountered in double_scalars

Problem description

Problem encountered : warning

Steps/code

Versions

gensim gensim copied to clipboard

Coherence RuntimeWarnings : divide by zero encountered in double_scalars AND invalid value encountered in double_scalars

Problem description

Problem encountered : warning

Steps/code

Versions

gensim
gensim copied to clipboard