gensim icon indicating copy to clipboard operation
gensim copied to clipboard

'nan' output for CoherenceModel when calculating 'c_v'

Open job-almekinders opened this issue 3 years ago • 5 comments

Problem description

I'm using LDA Multicore from gensim 3.8.3. I'm training on my train corpus and I'm able to evaluate the train corpus using the CoherenceModel within Gensim, to calculate the 'c_v' value. However, when I'm trying to calculate the 'c_v' over my test set, it throws the following warning:

/Users/xxx/env/lib/python3.7/site-packages/gensim/topic_coherence/direct_confirmation_measure.py:204: RuntimeWarning: divide by zero encountered in double_scalars m_lr_i = np.log(numerator / denominator) /Users/xxx/lib/python3.7/site-packages/gensim/topic_coherence/indirect_confirmation_measure.py:323: RuntimeWarning: invalid value encountered in double_scalars return cv1.T.dot(cv2)[0, 0] / (_magnitude(cv1) * _magnitude(cv2))

Furthermore, the output value of the CoherenceModel is 'nan' for some of the topics and therefore I'm not able to evaluate my model on a heldout test set.

Steps/code/corpus to reproduce

I run the following code:

coherence_model_lda = models.CoherenceModel(model=lda_model,
                                            topics=topic_list,
                                            corpus=corpus,
                                            texts=texts,
                                            dictionary=train_dictionary,
                                            coherence=c_v,
                                            topn=20
                                            )

coherence_model_lda.get_coherence() = nan  # output of aggregated cv value

coherence_model_lda.get_coherence_per_topic() = [0.4855137269180713, 0.3718866594914528, nan, nan, nan, 0.6782845928414825, 0.21638660621444444, 0.22337594485796397, 0.5975773184175942, 0.721341268732559, 0.5299883104816663, 0.5057903454344682, 0.5818051100304473, nan, nan, 0.30613393712342557, nan, 0.4104488627000527, nan, nan, 0.46028708148750963, nan, 0.394606654755219, 0.520685457293826, 0.5918440959767729, nan, nan, 0.4842068862650447, 0.9350644411891258, nan, nan, 0.7471151926054456, nan, nan, 0.5084926961568169, nan, nan, 0.4322957454944861, nan, nan, nan, 0.6460815758337844, 0.5810936860540964, 0.6636319471764807, nan, 0.6129884526648472, 0.48915614063099017, 0.4746167359622748, nan, 0.6826979166639224] # output of coherence value per topic 

I've tried to increase the EPSILON value within: gensim.topic_coherence.direct_confirmation_measure, however, this doesn't have any effect.

Furthermore, I've tried to change the input arguments (e.g. exclude the dictionary argument) but this also doesn't have any effect. I think the error has to do something with the fact that quite a large portion of the words within the test set is not available in the train set, however, the EPSILON value should be able to handle this.

Versions

python
Python 3.7.2 (default, Dec  2 2020, 09:47:26) 
[Clang 9.0.0 (clang-900.0.39.2)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import platform; print(platform.platform())
Darwin-18.7.0-x86_64-i386-64bit
>>> import sys; print("Python", sys.version)
Python 3.7.2 (default, Dec  2 2020, 09:47:26) 
[Clang 9.0.0 (clang-900.0.39.2)]
>>> import struct; print("Bits", 8 * struct.calcsize("P"))
Bits 64
>>> import numpy; print("NumPy", numpy.__version__)
NumPy 1.18.5
>>> import scipy; print("SciPy", scipy.__version__)
SciPy 1.5.2
>>> import gensim; print("gensim", gensim.__version__)
gensim 3.8.3
>>> from gensim.models import word2vec;print("FAST_VERSION", word2vec.FAST_VERSION)

job-almekinders avatar Feb 04 '21 17:02 job-almekinders

I've found the source of the error. When one of the top words of the topics of the trained model has a word frequency count of 0 in the test corpus, the CoherenceModel will throw this warning and output 'nan' value.

I've make a small addition to the 'gensim.topic_coherence.direct_confirmation_measure.log_ratio_measure' function. You can place this function in a separate .py file and import it into your main. There, you can overwrite the the regular 'log_ratio_measure' function with the custom one.

The code to place in the .py file is the following:

            import logging
            
            import numpy as np
            from gensim.topic_coherence import direct_confirmation_measure
            
            log = logging.getLogger(__name__)
            
            ADD_VALUE = 1


                def custom_log_ratio_measure(segmented_topics, accumulator, normalize=False, with_std=False, with_support=False):
                    topic_coherences = []
                    num_docs = float(accumulator.num_docs)
                    for s_i in segmented_topics:
                        segment_sims = []
                        for w_prime, w_star in s_i:
                            w_prime_count = accumulator[w_prime]
                            w_star_count = accumulator[w_star]
                            co_occur_count = accumulator[w_prime, w_star]
                
                            if normalize:
                                # For normalized log ratio measure
                                numerator = custom_log_ratio_measure([[(w_prime, w_star)]], accumulator)[0]
                                co_doc_prob = co_occur_count / num_docs
                                m_lr_i = numerator / (-np.log(co_doc_prob + direct_confirmation_measure.EPSILON))
                            else:
                                # For log ratio measure without normalization
                                ### _custom: Added the following 6 lines, to prevent a division by zero error.
                                if w_star_count == 0:
                                    log.info(f"w_star_count of {w_star} == 0. Adding {ADD_VALUE} to the count to prevent error. ")
                                    w_star_count += ADD_VALUE
                                if w_prime_count == 0:
                                    log.info(f"w_prime_count of {w_prime} == 0. Adding {ADD_VALUE} to the count to prevent error. ")
                                    w_prime_count += ADD_VALUE
                                numerator = (co_occur_count / num_docs) + direct_confirmation_measure.EPSILON
                                denominator = (w_prime_count / num_docs) * (w_star_count / num_docs)
                                m_lr_i = np.log(numerator / denominator)
                
                            segment_sims.append(m_lr_i)
                
                        topic_coherences.append(direct_confirmation_measure.aggregate_segment_sims(segment_sims, with_std, with_support))
                
                    return topic_coherences

Then, you can overwrite the original function by doing the following:

                from gensim.topic_coherence import direct_confirmation_measure
                from my_custom_module import custom_log_ratio_measure
                
                direct_confirmation_measure.log_ratio_measure = custom_log_ratio_measure

I've ran a few tests to see whether the output values seem logical and in my opinion they do. However, I'm not 100% sure. If anyone would be able to verify this, it would be greatly appreciated.

Looking forward to read some replies!

job-almekinders avatar Feb 05 '21 12:02 job-almekinders

@Jobtimize Thank you for drawing our attention to this. Are you interested in making a pull request to fix the issue?

mpenkov avatar Feb 09 '21 02:02 mpenkov

Yes, I'll make a request somewhere in the next two weeks.

job-almekinders avatar Feb 09 '21 07:02 job-almekinders

I have also come into this issue when trying to calculate C_V coherence on held out documents. Although I overcame it by adding the EPSILON value in cases where the denominator was 0, that is the direct_confirmation_measure.log_ratio_measure would look like:

def log_ratio_measure(segmented_topics, accumulator, normalize=False, with_std=False, with_support=False):
    topic_coherences = []
    num_docs = float(accumulator.num_docs)
    for s_i in segmented_topics:
        segment_sims = []
        for w_prime, w_star in s_i:
            w_prime_count = accumulator[w_prime]
            w_star_count = accumulator[w_star]
            co_occur_count = accumulator[w_prime, w_star]
            if normalize:
                # For normalized log ratio measure
                numerator = log_ratio_measure([[(w_prime, w_star)]], accumulator)[0]
                co_doc_prob = co_occur_count / num_docs
                m_lr_i = numerator / (-np.log(co_doc_prob + EPSILON))
            else:
                # For log ratio measure without normalization
                numerator = (co_occur_count / num_docs) + EPSILON
                denominator = (w_prime_count / num_docs) * (w_star_count / num_docs)
                # Check the value of denominator and adjust to epsilon to prevent divided by zero error
                if abs(denominator) < EPSILON:
                    denominator = denominator + EPSILON
                m_lr_i = np.log(numerator / denominator)
            segment_sims.append(m_lr_i)
        topic_coherences.append(aggregate_segment_sims(segment_sims, with_std, with_support))
    return topic_coherences

I wonder if the approach of @Jobtimize would work well in cases where the number of documents is low, i.e. lower than 10, 20, 100? since the denominator values with w_star_count += ADD_VALUE and w_prime_count += ADD_VALUE will be significantly larger than the expected approximately 0, maybe it should be tested it out.

And likewise to prevent the divide by zero error on the indirect_confirmation_measure._cossim method, added EPSILON value to the product of the magnitudes:

from gensim.topic_coherence.direct_confirmation_measure import aggregate_segment_sims, log_ratio_measure, EPSILON

def _cossim(cv1, cv2):
    denominator = _magnitude(cv1) * _magnitude(cv2)
    if abs(denominator) < EPSILON:
        denominator = denominator + EPSILON
    return cv1.T.dot(cv2)[0, 0] / denominator

Although I'm also not sure this is the approach that should be taken to solve the division error. Would be good to test it under different held out scenarios, low number of documents, large number of documents, out of topic documents and alike, although I'm not sure how to proceed with this.

Thanks, and would appreciate any replies regarding the approach.

ccastroh89 avatar Apr 03 '21 19:04 ccastroh89

Any update on this issue? I am still facing it and tried @Jobtimize answer but it causes all my coherence scores to be nearly 1.0 regardless of the number of topics which does not make sense.

Problem description

I'm using LDA Multicore from gensim 3.8.3. I'm training on my train corpus and I'm able to evaluate the train corpus using the CoherenceModel within Gensim, to calculate the 'c_v' value. However, when I'm trying to calculate the 'c_v' over my test set, it throws the following warning:

/Users/xxx/env/lib/python3.7/site-packages/gensim/topic_coherence/direct_confirmation_measure.py:204: RuntimeWarning: divide by zero encountered in double_scalars m_lr_i = np.log(numerator / denominator) /Users/xxx/lib/python3.7/site-packages/gensim/topic_coherence/indirect_confirmation_measure.py:323: RuntimeWarning: invalid value encountered in double_scalars return cv1.T.dot(cv2)[0, 0] / (_magnitude(cv1) * _magnitude(cv2))

Furthermore, the output value of the CoherenceModel is 'nan' for some of the topics and therefore I'm not able to evaluate my model on a heldout test set.

Steps/code/corpus to reproduce

I run the following code:

coherence_model_lda = models.CoherenceModel(model=lda_model,
                                            topics=topic_list,
                                            corpus=corpus,
                                            texts=texts,
                                            dictionary=train_dictionary,
                                            coherence=c_v,
                                            topn=20
                                            )

coherence_model_lda.get_coherence() = nan  # output of aggregated cv value

coherence_model_lda.get_coherence_per_topic() = [0.4855137269180713, 0.3718866594914528, nan, nan, nan, 0.6782845928414825, 0.21638660621444444, 0.22337594485796397, 0.5975773184175942, 0.721341268732559, 0.5299883104816663, 0.5057903454344682, 0.5818051100304473, nan, nan, 0.30613393712342557, nan, 0.4104488627000527, nan, nan, 0.46028708148750963, nan, 0.394606654755219, 0.520685457293826, 0.5918440959767729, nan, nan, 0.4842068862650447, 0.9350644411891258, nan, nan, 0.7471151926054456, nan, nan, 0.5084926961568169, nan, nan, 0.4322957454944861, nan, nan, nan, 0.6460815758337844, 0.5810936860540964, 0.6636319471764807, nan, 0.6129884526648472, 0.48915614063099017, 0.4746167359622748, nan, 0.6826979166639224] # output of coherence value per topic 

I've tried to increase the EPSILON value within: gensim.topic_coherence.direct_confirmation_measure, however, this doesn't have any effect.

Furthermore, I've tried to change the input arguments (e.g. exclude the dictionary argument) but this also doesn't have any effect. I think the error has to do something with the fact that quite a large portion of the words within the test set is not available in the train set, however, the EPSILON value should be able to handle this.

Versions

python
Python 3.7.2 (default, Dec  2 2020, 09:47:26) 
[Clang 9.0.0 (clang-900.0.39.2)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import platform; print(platform.platform())
Darwin-18.7.0-x86_64-i386-64bit
>>> import sys; print("Python", sys.version)
Python 3.7.2 (default, Dec  2 2020, 09:47:26) 
[Clang 9.0.0 (clang-900.0.39.2)]
>>> import struct; print("Bits", 8 * struct.calcsize("P"))
Bits 64
>>> import numpy; print("NumPy", numpy.__version__)
NumPy 1.18.5
>>> import scipy; print("SciPy", scipy.__version__)
SciPy 1.5.2
>>> import gensim; print("gensim", gensim.__version__)
gensim 3.8.3
>>> from gensim.models import word2vec;print("FAST_VERSION", word2vec.FAST_VERSION)

yudhiesh avatar Aug 14 '21 15:08 yudhiesh