`thresh` might not be working right
I managed to make the language model work on my data. Basically I get -inf all over the place if thresh > 1 (I was running with thresh=5, setting thresh=1 makes it work well).
The vocab items that end up with p_corpus = -inf when thresh>1 are hapax legomena in the corpus.
I didn't dig into the code yet :blush"
As the docstring of ParsimoniousLM.top states:
Get the top k terms of a document d and their log probabilities.
These -infs are just zero probabilities on the logarithmic scale, so this is expected behavior. The transformation to the linear scale is left up to the user (e.g. with np.exp).
Setting thresh > 1 doesn't have significant performance advantages as far as I can tell, but it may be desirable to reserve more probability mass for frequent words by chopping off the long tail.
(I don't expect the OP was still waiting for an answer more than 5 years after asking, but others might be :smile:)