tomotopy
tomotopy copied to clipboard
Questions about choosing coherence measures
Hello, I'm trying several models with different coherence measures, but I have some questions I need to understand.
- Is the value of the
SLIDING_WINDOWS
fixed? or can I change it withing a range so I can compare which size is the best? - I'm modeling social media posts, so the lengths of the posts are either long or very short, in this case what would be better for probability estimation:
DOCUMENT
orSLIDING_WINDOWS
? - For Pachinko Allocation model, I get some of the values of the C_V coherence per topic defined as nan, what could be the problem?
Thank you very much.
Hi @juneMJ
The coherence measures actually are defined like below:
https://github.com/bab2min/tomotopy/blob/d30964ce0610a5e34d3645cfc8c26d99536cac03/tomotopy/coherence.py#L62-L67
The second value is the default size of sliding windows. If you don't provide the window_size
argument for coherence.Coherence()
, the above default values are used. To find the best window_size, you should do some experiments to evaluate how well each coherence score with a specific window_size actually matches human's evaluation. But this is costly, so it is recommended to use the default values suggested in several papers.
I think, it is enough to use the preset ('u_mass', 'c_uci', 'c_npmi'
) rather the specific combinations. The 'c_v'
isn't not recommended since it has some issues(#121, #126).
And for the PAModel, it seems to have bug at implementation of Coherence module. I'll check more on this.
Thank you @bab2min for the clarifications!