ctcdecode icon indicating copy to clipboard operation
ctcdecode copied to clipboard

RAM leak with KenLM

Open adamnsandle opened this issue 6 years ago • 14 comments

Hello! For some reason our 3 GB russian KenLM arpa model (binarized) uses ~50 GB of RAM during CTCBeamDecoder class inizialization and estimation (100 beam width). When using KenLM python module with this model everything is ok! Model was trained on a big Russian corpus (37 labels).

adamnsandle avatar Aug 04 '19 10:08 adamnsandle

@SeanNaren would be really great if you could help out with this

@adamnsandle Which tokens do you use, how much data do use to train the model, how do you train the KenLM model, how do you initialize the class?

snakers4 avatar Aug 04 '19 11:08 snakers4

What we tried to do:

  • cut long sentences in KenLM train corpus (<100 / <200 / < inf)
  • cut number of sentences (100M, 200M, 800M)
  • lowercase/uppercase
  • different string endings ( '. \n', '\n')
  • differentent types of binarization (trie, probing)

Overall, RAM consumption drops with lesser numer of sentences / their length, but stays too huge, ~20 GB with a 100 MB model

Labels used - 'АБВГДЕЁЖЗИЙКЛМНОПРСТУФХЦЧШЩЪЫЬЭЮЯ2_ * ' (simple russian alphabet, 2 is a special symbol for a letter repeat, * as a string end symbol)

ctc class inizialization:

dcdr = CTCBeamDecoder(labels, lm_path, alpha=0.3, beta=0.4, cutoff_top_n=20, cutoff_prob=1, beam_width=100, num_processes=6, labels.index('_'))

(tried different num_processes and beam_width, did not work)

adamnsandle avatar Aug 04 '19 19:08 adamnsandle

Did you guys try turning this arpa file into a trie file? check the examples here

nvm just saw the above comment. You should always build a binary realistically, using the raw ARPA probably is overkill.

Also you should definitely check pruning specific ngram when making the LM if you guys haven't done so already. This can be seen here

SeanNaren avatar Aug 04 '19 20:08 SeanNaren

@adamnsandle Also which kenlm command did you use to train an LM?

@SeanNaren Which command do you use for your models?

snakers4 avatar Aug 05 '19 05:08 snakers4

@SeanNaren We used this script to train model: bin/lmplz -o 4 -S 50% -T temp/ --prune 0 30 60 130 --discount_fallback <web_all_norm.txt> web_all_norm.arpa

And to binarize it: ./build_binary -S 5G trie web_all_norm.arpa web_all_norm.arpa.bin or ./build_binary -S 5G trie -q 8 web_all_norm.arpa web_all_norm.arpa.bin or simply ./build_binary web_all_norm.arpa web_all_norm.arpa.bin

adamnsandle avatar Aug 05 '19 07:08 adamnsandle

How big is the output trie?

SeanNaren avatar Aug 05 '19 11:08 SeanNaren

2.05 GB

When we try to use lesser model (~200 MB in trie), RAM leak still presents (~30GB)

adamnsandle avatar Aug 05 '19 12:08 adamnsandle

Hi, I want to know that what does the KenLM model based? Word-based or character-based??? Thank you very much!@adamnsandle @SeanNaren

CXiaoDing avatar Sep 09 '19 12:09 CXiaoDing

+1 this issue. Even using the deep speech 1 lm binary causes massive ram use

SwapnilDreams100 avatar Nov 03 '19 07:11 SwapnilDreams100

I believe this is caused by an internal trie creation on model loading, which then stays in memory and consumes a lot of RAM. Mozilla in their version saves this trie to an external file, and doesn't generate it "on the fly".

buriy avatar Nov 04 '19 06:11 buriy

Loading model... Traceback (most recent call last): File "examples/demo-server.py", line 10, in import beamdecode File "/home/pi/masr/examples/../beamdecode.py", line 29, in blank_index, File "/home/pi/.local/lib/python3.7/site-packages/ctcdecode/init.py", line 18, in init self._num_labels) RuntimeError: third_party/kenlm/util/mmap.cc:122 in void* util::MapOrThrow(std::size_t, bool, int, bool, int, uint64_t) threw ErrnoException because `(ret = mmap(__null, size, protect, flags, fd, offset)) == ((void *) -1)'. Cannot allocate memory mmap failed for size 2953349384 at offset 0

baicaitongee avatar Apr 03 '20 04:04 baicaitongee

i have the simillar problems on isuue #137

baicaitongee avatar Apr 03 '20 04:04 baicaitongee

+1 this issue.

jonatasgrosman avatar Jun 27 '21 16:06 jonatasgrosman

+1. Is there a fix?

tobiolatunji avatar Jan 17 '22 21:01 tobiolatunji