kenlm icon indicating copy to clipboard operation
kenlm copied to clipboard

training corpus stuck

Open CuriousDeepLearner opened this issue 5 years ago • 11 comments

I tried to train a langague model with a corpus but seems it stucks at the beginning. Couldn't investigate the cause. bzcat clean_corpus.tar.bz2 | python process.py | kenlm/build/bin/lmplz -S 8G -o 5 > spanish_5gram.arpa

It stucks at this step: === 1/5 Counting and sorting n-grams === File stdin isn't normal. Using slower read() instead of mmap(). No progress bar. tcmalloc: large alloc 1511432192 bytes == 0x56104f802000 @ 0x7f70271e31e7 0x56104d4847e2 0x56104d419368 0x56104d3f81f6 0x56104d3e40d6 0x7f702537cb97 0x56104d3e5b1a tcmalloc: large alloc 7053344768 bytes == 0x5610a996c000 @ 0x7f70271e31e7 0x56104d4847e2 0x56104d46f6ca 0x56104d4700e8 0x56104d3f8213 0x56104d3e40d6 0x7f702537cb97 0x56104d3e5b1a

CuriousDeepLearner avatar Mar 14 '19 13:03 CuriousDeepLearner

Does your python program terminate? If you replace python process.py with cat does it work?

kpu avatar Mar 14 '19 14:03 kpu

No. Nothing changes. It still stucks at this point. I followed this tuto https://yidatao.github.io/2017-05-31/kenlm-ngram/

CuriousDeepLearner avatar Mar 15 '19 14:03 CuriousDeepLearner

@kpu For example, if I run model on a corpus of 1 line : marines están habilitando un emplazamiento donde reagrupar a unos digito digito digito combatientes de al qaida susceptibles de rendirse o de caer prisioneros

The problem still stays the same. I even tried on colab and same prob.

CuriousDeepLearner avatar Mar 15 '19 17:03 CuriousDeepLearner

Have you run

kenlm/build/bin/lmplz -S 8G -o 5 <README.md  >spanish_5gram.arpa

And how much RAM do you have?

kpu avatar Mar 15 '19 20:03 kpu

for the corpus in file text, I think I found why it didn't work. In fact it lacks < > before and end of text.txt. so it must be kenlm/build/bin/lmplz -S 8G -o 5 <text.txt >spanish_5gram.arpa

But if the corpus is compressed in file .tar for example, I don't know how to fix it. @kpu, do you have any idea? How could we run the model without uncompressing the file bzcat clean_corpus.tar.bz2 | python process.py | kenlm/build/bin/lmplz -S 8G -o 5 > spanish_5gram.arpa

CuriousDeepLearner avatar Mar 15 '19 22:03 CuriousDeepLearner

Does this work?

cat README.md |build/bin/lmplz --discount_fallback -o 5 >/dev/null

kpu avatar Mar 15 '19 22:03 kpu

@kpu No. It doesn't change anything...

CuriousDeepLearner avatar Mar 18 '19 08:03 CuriousDeepLearner

What does it print?

kpu avatar Mar 18 '19 09:03 kpu

It prints like before:

=== 1/5 Counting and sorting n-grams === File stdin isn't normal. Using slower read() instead of mmap(). No progress bar

CuriousDeepLearner avatar Mar 18 '19 09:03 CuriousDeepLearner

Something doesn't smell right and I'm unable to reproduce this. Is this running on Windows or something?

kpu avatar Mar 28 '19 14:03 kpu

I just encounter this phenomenon, I run the program on ubuntu.

GingerNg avatar Apr 03 '19 13:04 GingerNg