Weird negative correlation between C_V and the other measures
Dear all,
I've tested the Palmetto measures on the topics extracted from three various datasets and, everytime, the C_V measure is clearly negatively correlated to the other measures (in particular, UCI and UMASS). The resulting ranking is totally different and a topic considered as "good" by C_V is considered as "bad" by the others:

See below two examples extracted from 20 Newsgroups (10 top words). The values comes from the Palmetto online app (http://palmetto.aksw.org/palmetto-webapp/).
-
really bad topic: db bh mov mp cs si mf m4 mx mj 0.5880757872641666 with C_V -7.432080077660484 with UCI -8.11957747860957 with UMASS
-
quite good topic: president mr information 1993 national april states american year united 0.3779058142243831 with C_V -2.5019232533537896 with UCI -3.443209220661342 with UMASS
What is weird is that I've checked the paper of Michael and all the measures have to be maximized :-|. If anyone can help me to solve this intriguing issue...
Hi Velcin,
thanks for using Palmetto and for sharing these interesting examples. The bad topic seems to contain words that are very rare. Maybe this is a special case in which C_V has a bad behaviour. However, this wouldn't explain its overall negative correlation :thinking:
Could you please give some more information on your preprocessing of the documents? From the quite good topic, I can already see that you kept numbers (e.g., 1993). Did you used lemmatization or stemming?
How many topics did you created?
What do you mean with "all the measures have to be maximized". Do you mean that for all coherences a higher value is better?
My preprocessing: standard tokenization, lowercasing, removing punctuation (actually, I just kept letters and digits), no lemmatization nor stemming.
For the 20 Newsgroups dataset (http://qwone.com/~jason/20Newsgroups/), I set the number of topics to the true number of classes (here, 20). Below are the entire list of topics (the probs are hidden):
topic 0 : article writes medical disease health patients cancer food doctor msg
topic 1 : space nasa writes article earth gov launch orbit shuttle system
topic 2 : president mr information 1993 national april states american year united
topic 3 : ax max g9v b8f a86 pl 145 1d9 0t 1t
topic 4 : image file ftp graphics files software pub data jpeg images
topic 5 : 0d cx w7 145 34u ah scx t7 uw lk
topic 6 : car writes article good cars power ve engine work time
topic 7 : 10 00 12 15 11 20 14 93 92 18
topic 8 : writes article sex homosexual cramer men moral gay objective morality
topic 9 : israel jews turkish armenian writes article israeli armenians jewish war
topic 10 : gun writes article government law guns state rights crime control
topic 11 : windows dos drive system writes card mac article apple problem
topic 12 : window file program server motif widget set display application output
topic 13 : people fire writes fbi children article koresh batf didn government
topic 14 : god jesus christian bible christ church christians writes faith life
topic 15 : key encryption chip government clipper writes keys article system security
topic 16 : db bh mov mp cs si mf m4 mx mj
topic 17 : writes year game team article games baseball players good league
topic 18 : mail university writes article ca internet email cs ac fax
topic 19 : don people time make good point things ve question fact
By "all the measures have to be maximized", I exactly mean that for all coherences a higher value is better. It's because they are based on a variant of the PMI. It's what I've understood from your paper ;).
I found similar negative correlations on both topic modeling I tested.

@provalis thanks for sharing your results. Can you please describe your preprocessing process?
@Velcin from my point of view, the problem is caused by the difference in the preprocessing of your corpus compared to the Wikipedia documents I used for indexing. Since you didn't used lemmatization, your topics can contain different word forms that can not be found in the index. I wanted to take a deeper look into your example and check the counts of the single words and word pairs. Unfortunately, I didn't had the time to do that until now and most probably won't have it during this month. Sorry for that.
Note the index you are using for calculating the coherence should always be created with exactly the same preprocessing used for the topic modeling corpus. Otherwise the numbers might not be reliable.
Thank you for the answer Michaël. By the way, I've created my own index with the exact same preprocessing than the one used for topic modeling... If you don't have the time, please tell me the easiest way to control the counts to be sure there is no miscoding or whatever can be there.
Did you used your own index for the numbers reported above? I thought you have used the webservice.
Sorry if I'm not clear. The correlation matrix above has been created based on my own index. But for the sake of comparison (and to be sure that my indexing wasn't wrong), I've re-computed the measures for 2 topics by using your demo app this time. We can make the same observation of negative correlation. I acknowledge that I feed your index with unnormalized termes (e.g., states).
Finally, I had some time to take a deeper look into the two examples of Velcin. The probabilities clearly support your impression, that the bad topic should have a very low coherence while the good topic should have a higher value.
The pairs of the bad topic have much lower probabilities (in many cases they simply do not occur together)
| bh | mov | mp | cs | si | mf | m4 | mx | mj | |
|---|---|---|---|---|---|---|---|---|---|
| bh | 1.508E-4 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| mov | 0.0 | 5.652E-5 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| mp | 0.0 | 0.0 | 0.004 | 0.0 | 4.575E-6 | 4.199E-6 | 3.092E-6 | 0.0 | 2.213E-6 |
| cs | 0.0 | 0.0 | 0.0 | 2.636E-4 | 1.719E-6 | 1.267E-6 | 0.0 | 0.0 | 1.320E-6 |
| si | 0.0 | 0.0 | 4.575E-6 | 1.719E-6 | 0.001 | 1.390E-6 | 0.0 | 1.385E-6 | 1.954E-6 |
| mf | 0.0 | 0.0 | 4.199E-6 | 1.267E-6 | 1.390E-6 | 2.503E-4 | 0.0 | 0.0 | 2.028E-6 |
| m4 | 0.0 | 0.0 | 3.092E-6 | 0.0 | 0.0 | 0.0 | 3.033E-4 | 0.0 | 0.0 |
| mx | 0.0 | 0.0 | 0.0 | 0.0 | 1.385E-6 | 0.0 | 0.0 | 1.781E-4 | 0.0 |
| mj | 0.0 | 0.0 | 2.213E-6 | 1.320E-6 | 1.954E-6 | 2.028E-6 | 0.0 | 0.0 | 2.390E-4 |
The word pairs of the better topic have higher probabilities (apart from ("united", "mr.") and ("states", "mr."))
| mr | information | 1993 | national | april | states | american | year | united | |
|---|---|---|---|---|---|---|---|---|---|
| mr | 0.015 | 4.158E-4 | 4.695E-4 | 0.001 | 9.616E-4 | 0.0 | 0.001 | 0.003 | 0.0 |
| information | 4.158E-4 | 0.031 | 6.499E-4 | 0.003 | 0.001 | 1.871E-6 | 0.002 | 0.005 | 2.145E-6 |
| 1993 | 4.695E-4 | 6.499E-4 | 0.026 | 0.003 | 0.001 | 0.0 | 0.002 | 0.006 | 9.183E-6 |
| national | 0.001 | 0.003 | 0.003 | 0.094 | 0.006 | 6.308E-6 | 0.011 | 0.025 | 2.712E-5 |
| april | 9.616E-4 | 0.001 | 0.001 | 0.006 | 0.061 | 3.307E-6 | 0.005 | 0.014 | 4.116E-5 |
| states | 0.0 | 1.871E-6 | 0.0 | 6.308E-6 | 3.307E-6 | 3.660E-5 | 9.422E-6 | 7.391E-6 | 0.0 |
| american | 0.001 | 0.002 | 0.002 | 0.011 | 0.005 | 9.422E-6 | 0.080 | 0.020 | 9.595E-6 |
| year | 0.003 | 0.005 | 0.006 | 0.025 | 0.014 | 7.391E-6 | 0.020 | 0.213 | 1.137E-4 |
| united | 0.0 | 2.145E-6 | 9.183E-6 | 2.712E-5 | 4.116E-5 | 0.0 | 9.595E-6 | 1.137E-4 | 3.536E-4 |
I think that there is a numeric problem that leads to high NPMI values. I will dig deeper as soon I have again one or two free hours :wink:
Thank you for you concern. Feel free to ask me if I can help you in any way.
Hi Velcin,
I think that the major problem of the NPMI implementation of Palmetto is the epsilon (e). As described in the paper, the NPMI is implemented as
NPMI(W',W*)= log((P(W',W*) + e)/(P(W')*P(W*)))/log(P(W',W*) + e)
The problem of this implementation is that for very small probabilities the influence of e grows. This causes larger NPMI values for your "bad" topic.
In difference to the PMI, the NPMI does not need the epsilon and I tested an implementation without it:
NPMI(W',W*)= 0 , if P(W',W*)=0, P(W')=0 or P(W*)=0
= log(P(W',W*)/(P(W')*P(W*)))/log(P(W',W*)) , else
Used in the C_V coherence (C_V*), it leads to the following results: C_V*(db bh mov mp cs si mf m4 mx mj) = 0.3507252914369107 C_V*(president mr information 1993 national april states american year united) = 0.3387159339633883
Note that the lowest C_V* value for 10 top words is 0.3162277660168379, i.e., both topics are rated as bad. This can be seen with a look at the single NPMI values
| bh | mov | mp | cs | si | mf | m4 | mx | mj |
|---|---|---|---|---|---|---|---|---|
| bh | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| mov | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| mp | 0.0 | 0.0 | 1.0 | 0.0 | 0.001 | 0.011 | 0.004 | 0.0 |
| cs | 0.0 | 0.0 | 0.0 | 1.0 | 0.010 | 0.047 | 0.0 | 0.0 |
| si | 0.0 | 0.0 | 0.001 | 0.010 | 1.0 | 0.007 | 0.0 | 0.012 |
| mf | 0.0 | 0.0 | 0.011 | 0.047 | 0.007 | 1.0 | 0.0 | 0.0 |
| m4 | 0.0 | 0.0 | 0.004 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 |
| mx | 0.0 | 0.0 | 0.0 | 0.0 | 0.012 | 0.0 | 0.0 | 1.0 |
| mj | 0.0 | 0.0 | 0.003 | 0.050 | 0.014 | 0.072 | 0.0 | 0.0 |
Note that some of the words have an NPMI > 0 because they are used as abbreviations or top level domains.
| mr | information | 1993 | national | april | states | american | year | united |
|---|---|---|---|---|---|---|---|---|
| mr | 1.0 | 4.4E-4 | 2.3E-4 | 6.04E-4 | 2.0E-6 | 0.0 | 1.5E-4 | 3.2E-4 |
| information | 4.4E-4 | 1.0 | 0.001 | 0.001 | 8.4E-4 | 0.001 | 0.0010 | 0.002 |
| 1993 | 2.3E-4 | 0.001 | 1.0 | 0.001 | 6.6E-4 | 0.0 | 0.001 | 4.9E-4 |
| national | 6.0E-4 | 0.001 | 0.001 | 1.0 | 5.3E-4 | 0.002 | 0.006 | 0.003 |
| april | 2.0E-6 | 8.4E-4 | 6.6E-4 | 5.3E-4 | 1.0 | 9.6E-4 | 6.8E-5 | 7.1E-4 |
| states | 0.0 | 0.001 | 0.0 | 0.002 | 9.6E-4 | 1.0 | 0.010 | 2.3E-5 |
| american | 1.5E-4 | 0.001 | 0.001 | 0.006 | 6.8E-5 | 0.010 | 1.0 | 0.001 |
| year | 3.2E-4 | 0.002 | 4.9E-4 | 0.003 | 7.1E-4 | 2.3E-5 | 0.001 | 1.0 |
| united | 0.0 | 0.015 | 7.4E-6 | 3.8E-4 | 0.004 | 0.0 | 0.008 | 0.002 |
You can see that the NPMI values are mostly larger than 0 but still very low. From my point of view this is caused by the high probabilities that the single words have. Thus, the occurrence of the word pairs are not different to their expected occurrence if the words don't have a relation to each other.
Thus, C_V (without the epsilon) states, that the first topic is comparing unknown/rare words that are not very related to each other while the second topic combines common words.
@provalis which corpus did you use? Can you please provide two example topics?
Thank you, I get it. Therefore it'd be nice to test the new unsmoothed NMPI measures to recompute the correlations. Do you plan to update the code for integrating C_V*?
Yes, after making sure that the removal of the epsilon does not create other side effects I would like to implement NPMI* and C_V* (maybe with different names) and recompute the values from the paper to make sure that their performance ist not lowered by this change. However, my time for this project is very limited at the moment :disappointed:
When do you need the implementation and in which form do you need it? As java class or as command line parameter?
Please take all the time you need. In the meantime, we can still use other measures. Actually, we've integrated palmetto into our own java project. I work with cgravier who has upgraded palmetto to work with a recent version of lucene (see the discussion there: https://github.com/AKSW/Palmetto/issues/8).
According to my experiment results, solving #81 also solved this issue. Sorry for the very long time that it took to find the issue.