ETM icon indicating copy to clipboard operation
ETM copied to clipboard

Topic Coherence Computation: Division by 45?

Open mona-timmermann opened this issue 4 years ago • 2 comments

Why are they dividing by 45 for topic coherence based on normalised PMI? It says in the paper but the computation in the code looks different to me.

Screen Shot 2020-12-10 at 16 38 23

mona-timmermann avatar Dec 10 '20 15:12 mona-timmermann

Hi mona-timmermann, the reason for the 45 is that there are 45 ways of picking 2 distinct words from a list of 10 words. Equivalently, there are 45 (i, j) summation indices used in the TC equation above. You divide by 45 so that you have the average PMI.

jfcann avatar Jan 21 '21 18:01 jfcann

If we run the 'eval' mode, then the log file will show counter = 55. I think this is due to a tiny error in the get_topic_coherence() function: top_10 = list(beta[k].argsort()[-11:][::-1]). It should instead be top_10 = list(beta[k].argsort()[-10:][::-1]). After we change it, the counter will equal to 45.

yuyangstatistics avatar Nov 07 '21 15:11 yuyangstatistics