pocolm icon indicating copy to clipboard operation
pocolm copied to clipboard

maxent LMs

Open danpovey opened this issue 8 years ago • 5 comments

Another issue for anyone who's watching this project: it would be nice, as an additional baseline for the paper, to try maxent LMs. Can someone figure out how to do this on, say, Switchboard or tedlium?

danpovey avatar Jun 04 '16 00:06 danpovey

... I think the latest version of SRILM supports them, and they're supposed to be a little better than regular Kneser-Ney LMs.

danpovey avatar Jun 04 '16 00:06 danpovey

FYI, on a news 1.5GB corpus, I get: Order 3 Order 4 srilm size ppl size ppl Unpruned 767,2 92,18 2071,7 66,86 Maxent 702,1 97,09 1952,8 70,33

not that good then

vince62s avatar Jun 30 '16 19:06 vince62s

I don't really understand what you are saying here, can you please format more clearly and use the English standard for decimals i.e. dot not comma?

I found the reason for the crash with 4-gram pruning you found before- it's about states with no counts being discarded when we need to keep the discount amount- and the fix is not a one-liner, I'll work on it today. It would affect even the un-pruned perplexities.

Dan

On Thu, Jun 30, 2016 at 12:21 PM, vince62s [email protected] wrote:

FYI, on a news 1.5GB corpus, I get: Order 3 Order 4 srilm size ppl size ppl Unpruned 767,2 92,18 2071,7 66,86 Maxent 702,1 97,09 1952,8 70,33

not that good then

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/danpovey/pocolm/issues/12#issuecomment-229761651, or mute the thread https://github.com/notifications/unsubscribe/ADJVuwZZqvFcnck5euySeiBFxzn15AT6ks5qRBcrgaJpZM4IuAMi .

danpovey avatar Jun 30 '16 19:06 danpovey

yeah sorry copy paste from Excel. Order 3 srilm standard size=767.2 MB - ppl=92.18 srilm maxent size=702.1 MB - ppl=97.09 Order 4 srilm standard size=2071.7 MB - ppl=66.86 srilm maxent size=1952.8 MB - ppl=70.33

The corpus is "French news shuffle 2014" about 1.5 GB text file, I took out 10k sentences for a dev set. Just for info the order 4 Maxent run took 2.5 hours and up to 70GB of ram....

vince62s avatar Jun 30 '16 19:06 vince62s

what I am trying to say here is that these results are somehow surprising, because when I ran it on the cantab-tedlium text corpus (entropy filtered) maxent gave better results. But then I read Tanel's paper on Maxent, and improvements were not so obvious.

vince62s avatar Jun 30 '16 19:06 vince62s