Vincent Nguyen

Results 123 comments of Vincent Nguyen

btw until recently I didn't know this but it works fine. eg. Poco and srilm are in line with the scenario out of domain = cantab text, in-domain =...

If I may suggest, a cleaning option would be nice, because right now it leaves behind a bunch of files that takes much much space.

Biggest contributor is definitely the work folder in optimize_vocabsize_order. If I am not mistaken it can be cleaned after the second call of Sorry I don't write in python...

If you need another programming subject, there is one thing that could be very useful. Remember at the begining of the project we were talking about target size for the...

Dan, I am sure you applied the min-counts to order 3 and above to replicate the SRILM behavior, but I really think pruning also lower order ie unigram and bi-gram...

well my comment was for unigrams and bi-grams ... anyway this can be done differently.

After last night fix it's running fine now. I am running some ppl and lm size right now to compare various situations.

FYI, on a news 1.5GB corpus, I get: Order 3 Order 4 srilm size ppl size ppl Unpruned 767,2 92,18 2071,7 66,86 Maxent 702,1 97,09 1952,8 70,33 not that good...

yeah sorry copy paste from Excel. Order 3 srilm standard size=767.2 MB - ppl=92.18 srilm maxent size=702.1 MB - ppl=97.09 Order 4 srilm standard size=2071.7 MB - ppl=66.86 srilm maxent...

what I am trying to say here is that these results are somehow surprising, because when I ran it on the cantab-tedlium text corpus (entropy filtered) maxent gave better results....