ETM
ETM copied to clipboard
Topic Coherence Computation: Division by 45?
Why are they dividing by 45 for topic coherence based on normalised PMI? It says in the paper but the computation in the code looks different to me.
data:image/s3,"s3://crabby-images/9620b/9620b3f65a2068253496504a61a7c3ed5641c14e" alt="Screen Shot 2020-12-10 at 16 38 23"
Hi mona-timmermann, the reason for the 45 is that there are 45 ways of picking 2 distinct words from a list of 10 words. Equivalently, there are 45 (i, j)
summation indices used in the TC equation above. You divide by 45 so that you have the average PMI.
If we run the 'eval' mode, then the log file will show counter = 55. I think this is due to a tiny error in the get_topic_coherence()
function: top_10 = list(beta[k].argsort()[-11:][::-1])
. It should instead be top_10 = list(beta[k].argsort()[-10:][::-1])
. After we change it, the counter will equal to 45.