ctcdecode RAM leak with KenLM

Hello! For some reason our 3 GB russian KenLM arpa model (binarized) uses ~50 GB of RAM during CTCBeamDecoder class inizialization and estimation (100 beam width). When using KenLM python module with this model everything is ok! Model was trained on a big Russian corpus (37 labels).

Aug 04 '19 10:08 adamnsandle

@SeanNaren would be really great if you could help out with this

@adamnsandle Which tokens do you use, how much data do use to train the model, how do you train the KenLM model, how do you initialize the class?

Aug 04 '19 11:08 snakers4

What we tried to do:

cut long sentences in KenLM train corpus (<100 / <200 / < inf)
cut number of sentences (100M, 200M, 800M)
lowercase/uppercase
different string endings ( '. \n', '\n')
differentent types of binarization (trie, probing)

Overall, RAM consumption drops with lesser numer of sentences / their length, but stays too huge, ~20 GB with a 100 MB model

Labels used - 'АБВГДЕЁЖЗИЙКЛМНОПРСТУФХЦЧШЩЪЫЬЭЮЯ2_ * ' (simple russian alphabet, 2 is a special symbol for a letter repeat, * as a string end symbol)

ctc class inizialization:

dcdr = CTCBeamDecoder(labels, lm_path, alpha=0.3, beta=0.4, cutoff_top_n=20, cutoff_prob=1, beam_width=100, num_processes=6, labels.index('_'))

(tried different num_processes and beam_width, did not work)

Aug 04 '19 19:08 adamnsandle

Did you guys try turning this arpa file into a trie file? check the examples here

nvm just saw the above comment. You should always build a binary realistically, using the raw ARPA probably is overkill.

Also you should definitely check pruning specific ngram when making the LM if you guys haven't done so already. This can be seen here

Aug 04 '19 20:08 SeanNaren

@adamnsandle Also which kenlm command did you use to train an LM?

@SeanNaren Which command do you use for your models?

Aug 05 '19 05:08 snakers4

@SeanNaren We used this script to train model: bin/lmplz -o 4 -S 50% -T temp/ --prune 0 30 60 130 --discount_fallback <web_all_norm.txt> web_all_norm.arpa

And to binarize it: ./build_binary -S 5G trie web_all_norm.arpa web_all_norm.arpa.bin or ./build_binary -S 5G trie -q 8 web_all_norm.arpa web_all_norm.arpa.bin or simply ./build_binary web_all_norm.arpa web_all_norm.arpa.bin

Aug 05 '19 07:08 adamnsandle

How big is the output trie?

Aug 05 '19 11:08 SeanNaren

2.05 GB

When we try to use lesser model (~200 MB in trie), RAM leak still presents (~30GB)

Aug 05 '19 12:08 adamnsandle

Hi, I want to know that what does the KenLM model based? Word-based or character-based??? Thank you very much!@adamnsandle @SeanNaren

Sep 09 '19 12:09 CXiaoDing

+1 this issue. Even using the deep speech 1 lm binary causes massive ram use

Nov 03 '19 07:11 SwapnilDreams100

I believe this is caused by an internal trie creation on model loading, which then stays in memory and consumes a lot of RAM. Mozilla in their version saves this trie to an external file, and doesn't generate it "on the fly".

Nov 04 '19 06:11 buriy

Loading model... Traceback (most recent call last): File "examples/demo-server.py", line 10, in import beamdecode File "/home/pi/masr/examples/../beamdecode.py", line 29, in blank_index, File "/home/pi/.local/lib/python3.7/site-packages/ctcdecode/init.py", line 18, in init self._num_labels) RuntimeError: third_party/kenlm/util/mmap.cc:122 in void* util::MapOrThrow(std::size_t, bool, int, bool, int, uint64_t) threw ErrnoException because `(ret = mmap(__null, size, protect, flags, fd, offset)) == ((void *) -1)'. Cannot allocate memory mmap failed for size 2953349384 at offset 0

Apr 03 '20 04:04 baicaitongee

i have the simillar problems on isuue #137

Apr 03 '20 04:04 baicaitongee

+1 this issue.

Jun 27 '21 16:06 jonatasgrosman

+1. Is there a fix?

Jan 17 '22 21:01 tobiolatunji

ctcdecode ctcdecode copied to clipboard

RAM leak with KenLM

ctcdecode
ctcdecode copied to clipboard