kenlm
kenlm copied to clipboard
training corpus stuck
I tried to train a langague model with a corpus but seems it stucks at the beginning. Couldn't investigate the cause.
bzcat clean_corpus.tar.bz2 | python process.py | kenlm/build/bin/lmplz -S 8G -o 5 > spanish_5gram.arpa
It stucks at this step:
=== 1/5 Counting and sorting n-grams === File stdin isn't normal. Using slower read() instead of mmap(). No progress bar. tcmalloc: large alloc 1511432192 bytes == 0x56104f802000 @ 0x7f70271e31e7 0x56104d4847e2 0x56104d419368 0x56104d3f81f6 0x56104d3e40d6 0x7f702537cb97 0x56104d3e5b1a tcmalloc: large alloc 7053344768 bytes == 0x5610a996c000 @ 0x7f70271e31e7 0x56104d4847e2 0x56104d46f6ca 0x56104d4700e8 0x56104d3f8213 0x56104d3e40d6 0x7f702537cb97 0x56104d3e5b1a
Does your python program terminate? If you replace python process.py
with cat
does it work?
No. Nothing changes. It still stucks at this point. I followed this tuto https://yidatao.github.io/2017-05-31/kenlm-ngram/
@kpu For example, if I run model on a corpus of 1 line :
marines están habilitando un emplazamiento donde reagrupar a unos digito digito digito combatientes de al qaida susceptibles de rendirse o de caer prisioneros
The problem still stays the same. I even tried on colab and same prob.
Have you run
kenlm/build/bin/lmplz -S 8G -o 5 <README.md >spanish_5gram.arpa
And how much RAM do you have?
for the corpus in file text, I think I found why it didn't work. In fact it lacks < >
before and end of text.txt
. so it must be kenlm/build/bin/lmplz -S 8G -o 5 <text.txt >spanish_5gram.arpa
But if the corpus is compressed in file .tar for example, I don't know how to fix it. @kpu, do you have any idea? How could we run the model without uncompressing the file
bzcat clean_corpus.tar.bz2 | python process.py | kenlm/build/bin/lmplz -S 8G -o 5 > spanish_5gram.arpa
Does this work?
cat README.md |build/bin/lmplz --discount_fallback -o 5 >/dev/null
@kpu No. It doesn't change anything...
What does it print?
It prints like before:
=== 1/5 Counting and sorting n-grams === File stdin isn't normal. Using slower read() instead of mmap(). No progress bar
Something doesn't smell right and I'm unable to reproduce this. Is this running on Windows or something?
I just encounter this phenomenon, I run the program on ubuntu.