weighwords icon indicating copy to clipboard operation
weighwords copied to clipboard

`thresh` might not be working right

Open vene opened this issue 12 years ago • 1 comments

I managed to make the language model work on my data. Basically I get -inf all over the place if thresh > 1 (I was running with thresh=5, setting thresh=1 makes it work well).

The vocab items that end up with p_corpus = -inf when thresh>1 are hapax legomena in the corpus.

I didn't dig into the code yet :blush"

vene avatar Aug 18 '13 14:08 vene

As the docstring of ParsimoniousLM.top states:

Get the top k terms of a document d and their log probabilities.

These -infs are just zero probabilities on the logarithmic scale, so this is expected behavior. The transformation to the linear scale is left up to the user (e.g. with np.exp).

Setting thresh > 1 doesn't have significant performance advantages as far as I can tell, but it may be desirable to reserve more probability mass for frequent words by chopping off the long tail.

(I don't expect the OP was still waiting for an answer more than 5 years after asking, but others might be :smile:)

aolieman avatar May 10 '19 19:05 aolieman