Palmetto Weird negative correlation between C

Dear all,

I've tested the Palmetto measures on the topics extracted from three various datasets and, everytime, the C_V measure is clearly negatively correlated to the other measures (in particular, UCI and UMASS). The resulting ranking is totally different and a topic considered as "good" by C_V is considered as "bad" by the others:

export

See below two examples extracted from 20 Newsgroups (10 top words). The values comes from the Palmetto online app (http://palmetto.aksw.org/palmetto-webapp/).

really bad topic: db bh mov mp cs si mf m4 mx mj 0.5880757872641666 with C_V -7.432080077660484 with UCI -8.11957747860957 with UMASS
quite good topic: president mr information 1993 national april states american year united 0.3779058142243831 with C_V -2.5019232533537896 with UCI -3.443209220661342 with UMASS

What is weird is that I've checked the paper of Michael and all the measures have to be maximized :-|. If anyone can help me to solve this intriguing issue...

Jan 05 '17 16:01 Velcin

Hi Velcin,

thanks for using Palmetto and for sharing these interesting examples. The bad topic seems to contain words that are very rare. Maybe this is a special case in which C_V has a bad behaviour. However, this wouldn't explain its overall negative correlation :thinking:

Could you please give some more information on your preprocessing of the documents? From the quite good topic, I can already see that you kept numbers (e.g., 1993). Did you used lemmatization or stemming?

How many topics did you created?

What do you mean with "all the measures have to be maximized". Do you mean that for all coherences a higher value is better?

Jan 06 '17 10:01 MichaelRoeder

My preprocessing: standard tokenization, lowercasing, removing punctuation (actually, I just kept letters and digits), no lemmatization nor stemming.

For the 20 Newsgroups dataset (http://qwone.com/~jason/20Newsgroups/), I set the number of topics to the true number of classes (here, 20). Below are the entire list of topics (the probs are hidden):

topic   0 : article writes medical disease health patients cancer food doctor msg
topic   1 : space nasa writes article earth gov launch orbit shuttle system
topic   2 : president mr information 1993 national april states american year united
topic   3 : ax max g9v b8f a86 pl 145 1d9 0t 1t
topic   4 : image file ftp graphics files software pub data jpeg images
topic   5 : 0d cx w7 145 34u ah scx t7 uw lk
topic   6 : car writes article good cars power ve engine work time
topic   7 : 10 00 12 15 11 20 14 93 92 18
topic   8 : writes article sex homosexual cramer men moral gay objective morality
topic   9 : israel jews turkish armenian writes article israeli armenians jewish war
topic  10 : gun writes article government law guns state rights crime control
topic  11 : windows dos drive system writes card mac article apple problem
topic  12 : window file program server motif widget set display application output
topic  13 : people fire writes fbi children article koresh batf didn government
topic  14 : god jesus christian bible christ church christians writes faith life
topic  15 : key encryption chip government clipper writes keys article system security
topic  16 : db bh mov mp cs si mf m4 mx mj
topic  17 : writes year game team article games baseball players good league
topic  18 : mail university writes article ca internet email cs ac fax
topic  19 : don people time make good point things ve question fact

By "all the measures have to be maximized", I exactly mean that for all coherences a higher value is better. It's because they are based on a variant of the PMI. It's what I've understood from your paper ;).

Jan 06 '17 12:01 Velcin

I found similar negative correlations on both topic modeling I tested.

Feb 04 '17 21:02 provalis

@provalis thanks for sharing your results. Can you please describe your preprocessing process?

@Velcin from my point of view, the problem is caused by the difference in the preprocessing of your corpus compared to the Wikipedia documents I used for indexing. Since you didn't used lemmatization, your topics can contain different word forms that can not be found in the index. I wanted to take a deeper look into your example and check the counts of the single words and word pairs. Unfortunately, I didn't had the time to do that until now and most probably won't have it during this month. Sorry for that.

Note the index you are using for calculating the coherence should always be created with exactly the same preprocessing used for the topic modeling corpus. Otherwise the numbers might not be reliable.

Feb 06 '17 08:02 MichaelRoeder

Thank you for the answer Michaël. By the way, I've created my own index with the exact same preprocessing than the one used for topic modeling... If you don't have the time, please tell me the easiest way to control the counts to be sure there is no miscoding or whatever can be there.

Feb 06 '17 09:02 Velcin

Did you used your own index for the numbers reported above? I thought you have used the webservice.

Feb 06 '17 09:02 MichaelRoeder

Sorry if I'm not clear. The correlation matrix above has been created based on my own index. But for the sake of comparison (and to be sure that my indexing wasn't wrong), I've re-computed the measures for 2 topics by using your demo app this time. We can make the same observation of negative correlation. I acknowledge that I feed your index with unnormalized termes (e.g., states).

Feb 06 '17 09:02 Velcin

Finally, I had some time to take a deeper look into the two examples of Velcin. The probabilities clearly support your impression, that the bad topic should have a very low coherence while the good topic should have a higher value.

The pairs of the bad topic have much lower probabilities (in many cases they simply do not occur together)

	bh	mov	mp	cs	si	mf	m4	mx	mj
bh	1.508E-4	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
mov	0.0	5.652E-5	0.0	0.0	0.0	0.0	0.0	0.0	0.0
mp	0.0	0.0	0.004	0.0	4.575E-6	4.199E-6	3.092E-6	0.0	2.213E-6
cs	0.0	0.0	0.0	2.636E-4	1.719E-6	1.267E-6	0.0	0.0	1.320E-6
si	0.0	0.0	4.575E-6	1.719E-6	0.001	1.390E-6	0.0	1.385E-6	1.954E-6
mf	0.0	0.0	4.199E-6	1.267E-6	1.390E-6	2.503E-4	0.0	0.0	2.028E-6
m4	0.0	0.0	3.092E-6	0.0	0.0	0.0	3.033E-4	0.0	0.0
mx	0.0	0.0	0.0	0.0	1.385E-6	0.0	0.0	1.781E-4	0.0
mj	0.0	0.0	2.213E-6	1.320E-6	1.954E-6	2.028E-6	0.0	0.0	2.390E-4

The word pairs of the better topic have higher probabilities (apart from ("united", "mr.") and ("states", "mr."))

	mr	information	1993	national	april	states	american	year	united
mr	0.015	4.158E-4	4.695E-4	0.001	9.616E-4	0.0	0.001	0.003	0.0
information	4.158E-4	0.031	6.499E-4	0.003	0.001	1.871E-6	0.002	0.005	2.145E-6
1993	4.695E-4	6.499E-4	0.026	0.003	0.001	0.0	0.002	0.006	9.183E-6
national	0.001	0.003	0.003	0.094	0.006	6.308E-6	0.011	0.025	2.712E-5
april	9.616E-4	0.001	0.001	0.006	0.061	3.307E-6	0.005	0.014	4.116E-5
states	0.0	1.871E-6	0.0	6.308E-6	3.307E-6	3.660E-5	9.422E-6	7.391E-6	0.0
american	0.001	0.002	0.002	0.011	0.005	9.422E-6	0.080	0.020	9.595E-6
year	0.003	0.005	0.006	0.025	0.014	7.391E-6	0.020	0.213	1.137E-4
united	0.0	2.145E-6	9.183E-6	2.712E-5	4.116E-5	0.0	9.595E-6	1.137E-4	3.536E-4

I think that there is a numeric problem that leads to high NPMI values. I will dig deeper as soon I have again one or two free hours :wink:

Feb 09 '17 21:02 MichaelRoeder

Thank you for you concern. Feel free to ask me if I can help you in any way.

Feb 10 '17 07:02 Velcin

Hi Velcin,

I think that the major problem of the NPMI implementation of Palmetto is the epsilon (e). As described in the paper, the NPMI is implemented as

NPMI(W',W*)= log((P(W',W*) + e)/(P(W')*P(W*)))/log(P(W',W*) + e)

The problem of this implementation is that for very small probabilities the influence of e grows. This causes larger NPMI values for your "bad" topic.

In difference to the PMI, the NPMI does not need the epsilon and I tested an implementation without it:

NPMI(W',W*)= 0                                          , if P(W',W*)=0, P(W')=0 or P(W*)=0
           = log(P(W',W*)/(P(W')*P(W*)))/log(P(W',W*))  , else

Used in the C_V coherence (C_V*), it leads to the following results: C_V*(db bh mov mp cs si mf m4 mx mj) = 0.3507252914369107 C_V*(president mr information 1993 national april states american year united) = 0.3387159339633883

Note that the lowest C_V* value for 10 top words is 0.3162277660168379, i.e., both topics are rated as bad. This can be seen with a look at the single NPMI values

bh	mov	mp	cs	si	mf	m4	mx	mj
bh	1.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
mov	0.0	1.0	0.0	0.0	0.0	0.0	0.0	0.0
mp	0.0	0.0	1.0	0.0	0.001	0.011	0.004	0.0
cs	0.0	0.0	0.0	1.0	0.010	0.047	0.0	0.0
si	0.0	0.0	0.001	0.010	1.0	0.007	0.0	0.012
mf	0.0	0.0	0.011	0.047	0.007	1.0	0.0	0.0
m4	0.0	0.0	0.004	0.0	0.0	0.0	1.0	0.0
mx	0.0	0.0	0.0	0.0	0.012	0.0	0.0	1.0
mj	0.0	0.0	0.003	0.050	0.014	0.072	0.0	0.0

Note that some of the words have an NPMI > 0 because they are used as abbreviations or top level domains.

mr	information	1993	national	april	states	american	year	united
mr	1.0	4.4E-4	2.3E-4	6.04E-4	2.0E-6	0.0	1.5E-4	3.2E-4
information	4.4E-4	1.0	0.001	0.001	8.4E-4	0.001	0.0010	0.002
1993	2.3E-4	0.001	1.0	0.001	6.6E-4	0.0	0.001	4.9E-4
national	6.0E-4	0.001	0.001	1.0	5.3E-4	0.002	0.006	0.003
april	2.0E-6	8.4E-4	6.6E-4	5.3E-4	1.0	9.6E-4	6.8E-5	7.1E-4
states	0.0	0.001	0.0	0.002	9.6E-4	1.0	0.010	2.3E-5
american	1.5E-4	0.001	0.001	0.006	6.8E-5	0.010	1.0	0.001
year	3.2E-4	0.002	4.9E-4	0.003	7.1E-4	2.3E-5	0.001	1.0
united	0.0	0.015	7.4E-6	3.8E-4	0.004	0.0	0.008	0.002

You can see that the NPMI values are mostly larger than 0 but still very low. From my point of view this is caused by the high probabilities that the single words have. Thus, the occurrence of the word pairs are not different to their expected occurrence if the words don't have a relation to each other.

Thus, C_V (without the epsilon) states, that the first topic is comparing unknown/rare words that are not very related to each other while the second topic combines common words.

@provalis which corpus did you use? Can you please provide two example topics?

Feb 12 '17 20:02 MichaelRoeder

Thank you, I get it. Therefore it'd be nice to test the new unsmoothed NMPI measures to recompute the correlations. Do you plan to update the code for integrating C_V*?

Feb 14 '17 11:02 Velcin

Yes, after making sure that the removal of the epsilon does not create other side effects I would like to implement NPMI* and C_V* (maybe with different names) and recompute the values from the paper to make sure that their performance ist not lowered by this change. However, my time for this project is very limited at the moment :disappointed:

When do you need the implementation and in which form do you need it? As java class or as command line parameter?

Feb 14 '17 12:02 MichaelRoeder

Please take all the time you need. In the meantime, we can still use other measures. Actually, we've integrated palmetto into our own java project. I work with cgravier who has upgraded palmetto to work with a recent version of lucene (see the discussion there: https://github.com/AKSW/Palmetto/issues/8).

Feb 14 '17 12:02 Velcin

According to my experiment results, solving #81 also solved this issue. Sorry for the very long time that it took to find the issue.

Nov 10 '22 17:11 MichaelRoeder

Weird negative correlation between C_V and the other measures