tomotopy icon indicating copy to clipboard operation
tomotopy copied to clipboard

Question: Calculating Coherence. What words are expected as Targets?

Open hhagedorn opened this issue 3 years ago • 8 comments

Hello @bab2min,

I am trying to use your implementation of the C_v coherence measure to evaluate both topic models that are included in tomotopy and some that are not. Therefore I generated a tomotpy.utils.Corpus to initialise the .Coherence class.

But I am a little confused with the targets parameter. Does it expect the whole vocabulary of the Corpus (or at least the vocabulary that is relevant for the coherence, e.g. all words from LDAModel.used_vocabs) or only a set of words that I want to later check for coherence (e.g. all words in my to-be evaluated topics)?

I am not exactly sure how to understand the sentence "Only words that are provided as targets are included in probability estimation."

Thank you already in advance!

hhagedorn avatar May 18 '21 06:05 hhagedorn

Hi @hhagedorn, Sorry for the confusion due to the unclear documentation. For targets, the latter is correct. In other words, you just pass a set of words in to-be evaluated topics as targets.

The reason why targets is required is for computational efficiency. Calculating co-occurrence of all words from LDAModel.used_vocabs consumes a lot of time and memory. If you know the words to be evaluated for coherence, it can calculate their co-occurrences only instead of all. For this purpose, Coherence provides targets argument.

I'll supplement this explanation to the documentation in the next update. Thank you for your good question!

bab2min avatar May 24 '21 09:05 bab2min

Hello @bab2min - thank you for the time you put into maintaining tomotopy!

I'm having some trouble that might be similar to @hhagedorn : I'm calculating the c_v coherence on a model that had earlier been trained and saved to disk, like this:

mdl = tomotopy.LDAModel.load("saved_model.bin")
coh = tomotopy.coherence.Coherence(mdl, coherence='c_v')

On the second line, I'm not specifying targets value, only the model. I understand it might be slow because of the large number of targets (about 20000 unique tokens), but my concern is that it sometimes crashes and hangs, even with the same model on the same machine. If I specify u_mass, then it calculates the coherence within a few minutes, but c_v stops for hours. Sometimes it crashes with just "Killed" and sometimes I see bad_alloc. So I suppose it's deep inside the coherence. I run it under mprof (memory profiler) and it uses only about 1.1GB, nowhere near the memory limit. I get different behavior at different times on the same model, same machine.

tomotopy.isa returns 'avx2' and I am using an intel i7-11800H, python 3.8.10, ubuntu 20.04 on WSL2 under Windows 11. I get similar behavior when running on GCP or AWS. What would you recommend here?

Thank you!

benreaves avatar Jan 27 '22 09:01 benreaves

Hi @benreaves There appears to be some bugs in the current implementation of tomotopy.coherence. However, a similar situation was not reproduced in my test set, so it is difficult to analyze details. If possible, can you please share the saved_model.bin file that causes crashes? It will be of great help in figuring out the cause of the bug.

bab2min avatar Feb 03 '22 16:02 bab2min

Yes I will send it later today. Thank you for investigating!

On Thu, Feb 3, 2022, 08:06 Minchul Lee @.***> wrote:

Hi @benreaves https://github.com/benreaves There appears to be some bugs in the current implementation of tomotopy.coherence. However, a similar situation was not reproduced in my test set, so it is difficult to analyze details. If possible, can you please share the saved_model.bin file that causes crashes? It will be of great help in figuring out the cause of the bug.

— Reply to this email directly, view it on GitHub https://github.com/bab2min/tomotopy/issues/121#issuecomment-1029144798, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAR4AWIB6JXK6SGA36MUNCTUZKRXRANCNFSM45B4RODQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

You are receiving this because you were mentioned.Message ID: @.***>

benreaves avatar Feb 03 '22 16:02 benreaves

Yes, here it is! [1] The zip file contains

  • the model file (in folder 20220126065224i0)
  • coherence_later.py, which I use for calculating the coherence on saved models (did I do it correctly?)
  • results.csv containing a list of saved models. Only the first one is included in this zipfile (but then that one does cause the hang).

[1] https://drive.google.com/file/d/1s9WBQ_dxHV55qpy-mzSyB1tGpPuX7mhG/view?usp=sharing

On Thu, Feb 3, 2022 at 8:22 AM Ben Reaves @.***> wrote:

Yes I will send it later today. Thank you for investigating!

On Thu, Feb 3, 2022, 08:06 Minchul Lee @.***> wrote:

Hi @benreaves https://github.com/benreaves There appears to be some bugs in the current implementation of tomotopy.coherence. However, a similar situation was not reproduced in my test set, so it is difficult to analyze details. If possible, can you please share the saved_model.bin file that causes crashes? It will be of great help in figuring out the cause of the bug.

— Reply to this email directly, view it on GitHub https://github.com/bab2min/tomotopy/issues/121#issuecomment-1029144798, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAR4AWIB6JXK6SGA36MUNCTUZKRXRANCNFSM45B4RODQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

You are receiving this because you were mentioned.Message ID: @.***>

--


Ben Reaves

--

benreaves avatar Feb 03 '22 21:02 benreaves

BTW, it doesn't always give the same error - sometimes it's "bad_alloc" sometimes it just says "Killed" and exits with no traceback, and sometimes it just hangs for at least 8 hours. I really appreciate your looking into this!

On Thu, Feb 3, 2022 at 8:06 AM Minchul Lee @.***> wrote:

Hi @benreaves https://github.com/benreaves There appears to be some bugs in the current implementation of tomotopy.coherence. However, a similar situation was not reproduced in my test set, so it is difficult to analyze details. If possible, can you please share the saved_model.bin file that causes crashes? It will be of great help in figuring out the cause of the bug.

— Reply to this email directly, view it on GitHub https://github.com/bab2min/tomotopy/issues/121#issuecomment-1029144798, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAR4AWIB6JXK6SGA36MUNCTUZKRXRANCNFSM45B4RODQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

You are receiving this because you were mentioned.Message ID: @.***>

--


Ben Reaves

--

benreaves avatar Feb 03 '22 21:02 benreaves

@benreaves Thank you for sharing the files and details. I'll look into them!

bab2min avatar Feb 04 '22 03:02 bab2min

This issue is no longer important. Reasons:

  1. c_npmi seems to work fine, so I can use that instead of c_v
  2. c_v should be avoided, according to this serious issue from 2018: https://github.com/dice-group/Palmetto/issues/13

However, I am still having a numerical problem in add_doc() but it belongs in a new thread: #159

benreaves avatar Feb 09 '22 07:02 benreaves