tomotopy icon indicating copy to clipboard operation
tomotopy copied to clipboard

bug: cv topic coherence dependent on size of reference corpus

Open RutgerEttes opened this issue 3 years ago • 2 comments

I was calculating c_v topic coherence for several topic models with a reference corpus of 100000 wikipedia articles, with pretty high coherence scores (0.65ish). for my final results i wanted to use a larger amount of articles to avoid unnecessary bias from the sample of articles and therefore started using 200000 articles. when i did so, the coherence score dropped to approximately 0.5. I figured this was the result of bias in the first 100000 articles, so i tried calculating the topic coherence for the other 100000 articles. This, however, also resulted in topic coherence of approximately 0.65. I also tried calculating it for 10000 articles, which results in a topic coherence score of 0.75ish.

I tried calculating the topic coherence using gensim's topic coherence functions, and their implementation does not have this defect. The topic coherence scores calculated with gensim are also much lower than those calculated with tomotopy ( .35ish vs. .65ish). I also tried to see if the same effect exists for c_npmi topic coherence, this is not the case. gensim and tomotopy also both give similar(though not equal) c_npmi scores. I have not checked for the other topic coherence measures.

all of this points to an issue with tomotopy's cv topic coherence implementation.

RutgerEttes avatar Jun 07 '21 02:06 RutgerEttes

@RutgerEttes I'm surprised you were able to get tomotopy's coherence functions to work on a model/corpus with 100k articles, do you have any replication code? When I try to run coherence on 100k cleaned wikipedia articles, it takes hours and sometimes silently fails after consuming all 64GB of RAM on my machine...

Did you find that the coherence score for a set of topic words trained using tomotopy on a smaller corpus, 10k words, increased slightly in the Gensim c_v coherence when using a larger reference corpus (to 100k, then 200k)? I also found the same trend for c_npmi-- coherence score increased with more articles. I find this behavior counterintuitive-- why would topics trained on a smaller corpus appear more coherent when you evaluate them against a larger reference corpus? Shouldn't the additional articles have different word usage patterns that would dilute the topics' coherence? I am doing a great deal of preprocessing, like removing named entites and using a subset of wikipedia articles filtered by popularity based on traffic stats. And I fit the LDA model using IDF term weighting. See notebook link below. Could that explain why the coherence scores from Gensim increase with reference corpus size?

In this notebook I train a tomotopy model on 10k articles drawn from a cleaned wikipedia dataset, then evaluate the coherence using Gensim's coherence functions on reference corpora of 10k, 100k, and 200k articles. I didn't fit Gensim's LDA model because I find it to be very slow. I wanted to do the same comparison with tomotopy's Corpus utility and its coherence functions, but I ran out of patience (and RAM? usage goes to 100% then suddenly drops to ~0% while notebook cell remains busy...). I also trained a tomotopy LDA model on 100k articles and found that it had higher Gensim c_v coherence measures on a 200k article reference corpus than the model trained on 10k articles, but a lower coherence score on the training set.

I'm hoping that the sweet spot for the types of corpora I usually work with, which fit comfortably in memory, is to use tomotopy's models with Gensim's coherence functions. It can be annoying to translate between them but the code in the attached notebook provides some examples.

aaronjbecker avatar Dec 06 '21 16:12 aaronjbecker

same question here....My coherence takes centuries to calculate...

lkcao avatar May 06 '22 04:05 lkcao