tomotopy icon indicating copy to clipboard operation
tomotopy copied to clipboard

Questions about choosing coherence measures

Open juneMJ opened this issue 1 year ago • 2 comments

Hello, I'm trying several models with different coherence measures, but I have some questions I need to understand.

  1. Is the value of the SLIDING_WINDOWS fixed? or can I change it withing a range so I can compare which size is the best?
  2. I'm modeling social media posts, so the lengths of the posts are either long or very short, in this case what would be better for probability estimation: DOCUMENT or SLIDING_WINDOWS?
  3. For Pachinko Allocation model, I get some of the values of the C_V coherence per topic defined as nan, what could be the problem?

Thank you very much.

juneMJ avatar Jul 06 '22 14:07 juneMJ

Hi @juneMJ

The coherence measures actually are defined like below: https://github.com/bab2min/tomotopy/blob/d30964ce0610a5e34d3645cfc8c26d99536cac03/tomotopy/coherence.py#L62-L67 The second value is the default size of sliding windows. If you don't provide the window_size argument for coherence.Coherence(), the above default values are used. To find the best window_size, you should do some experiments to evaluate how well each coherence score with a specific window_size actually matches human's evaluation. But this is costly, so it is recommended to use the default values suggested in several papers.

I think, it is enough to use the preset ('u_mass', 'c_uci', 'c_npmi') rather the specific combinations. The 'c_v' isn't not recommended since it has some issues(#121, #126).

And for the PAModel, it seems to have bug at implementation of Coherence module. I'll check more on this.

bab2min avatar Jul 09 '22 04:07 bab2min

Thank you @bab2min for the clarifications!

juneMJ avatar Jul 09 '22 13:07 juneMJ