marian tcmalloc: large alloc 2818572288 bytes == 0x33daa000 @

[2019-12-06 00:18:07] [data] Loading vocabulary from JSON/Yaml file /191206/source_vocab.yml
[2019-12-06 00:18:08] [data] Setting vocabulary size for input 0 to 328116
[2019-12-06 00:18:08] [data] Loading vocabulary from JSON/Yaml file /191206/target_vocab.yml
[2019-12-06 00:18:09] [data] Setting vocabulary size for input 1 to 225581
[2019-12-06 00:18:10] [memory] Extending reserved space to 2048 MB (device gpu0)
[2019-12-06 00:18:10] Training started
[2019-12-06 00:18:10] [data] Shuffling files
[2019-12-06 00:18:10] [data] Done reading 1754741 sentences
[2019-12-06 00:18:14] [data] Done shuffling 1754741 sentences to temp files
[2019-12-06 00:18:14] [memory] Reserving 1615 MB, device gpu0
[2019-12-06 00:18:15] [memory] Reserving 1615 MB, device gpu0
tcmalloc: large alloc 2147483648 bytes == 0x33daa000 @
tcmalloc: large alloc 2281701376 bytes == 0x33daa000 @
tcmalloc: large alloc 2415919104 bytes == 0x33daa000 @
tcmalloc: large alloc 2550136832 bytes == 0x33daa000 @
tcmalloc: large alloc 2684354560 bytes == 0x33daa000 @
tcmalloc: large alloc 2818572288 bytes == 0x33daa000 @
tcmalloc: large alloc 2952790016 bytes == 0x33daa000 @
tcmalloc: large alloc 3087007744 bytes == 0x33daa000 @
tcmalloc: large alloc 3221225472 bytes == 0x33daa000 @
tcmalloc: large alloc 3355443200 bytes == 0x33daa000 @
tcmalloc: large alloc 3489660928 bytes == 0x33daa000 @
[2019-12-06 00:18:34] [memory] Reserving 3231 MB, device gpu0
tcmalloc: large alloc 4026531840 bytes == 0x33daa000 @
tcmalloc: large alloc 4429185024 bytes == 0x33daa000 @
[2019-12-06 00:18:54] Error: CUDA error 2 'out of memory' - /marian/src/tensors/gpu/device.cu:32: cudaMalloc(&data_, size)
[2019-12-06 00:18:54] Error: Aborted from virtual void marian::gpu::Device::reserve(size_t) in /marian/src/tensors/gpu/device.cu:32

[CALL STACK]
[0xb70bb7]
[0x5d028c]
[0x66a074]```


This is a France demo I trained. Memory increases significantly during training, resulting in out-of-memory.But the same number of Germans is not expected to have this problem。What is the cause of this problem? 
thanks

Dec 06 '19 05:12 sdlmw

Could you provide the command/config you use? More details would be helpful, e.g. what is the model you use or how large is it? What is your workspace? Do you train with mini-batch-fit?

Is this the only process running on the GPU?

Dec 06 '19 14:12 snukky

Hi snukky,

Thank you for your support.
./build/marian --train-sets /Marian/1_ForTrain/TagRemoved_Source_Train_2560035.tok.en /Marian/1_ForTrain/TagRemoved_Target_Train_2560035.tok.de --vocabs /Marian/source_vocab.yml /Marian/target_vocab.yml --model /Marian/pre-train_model.npz --devices 0 --dim-emb 500 --after-epochs 13 --max-length 70 --max-length-crop

The training file has 2560035 lines。I used the GTX1080ti to train this engine.

Dec 09 '19 03:12 sdlmw

Your vocabularies are huge, is that planned? Normally we would use something 10x smaller, this explains your model size due to the embeddings matrices:

[2019-12-06 00:18:08] [data] Setting vocabulary size for input 0 to 328116
[2019-12-06 00:18:09] [data] Setting vocabulary size for input 1 to 225581

Dec 09 '19 03:12 emjotde

HI emjotde, I build the vocabularies with the /build/marian-vocab command. Commas and periods don't do word breaks.

Dec 09 '19 03:12 sdlmw

You need to tokenize your data first, I also recommend use subword-segmentation. Look at these examples:

https://github.com/marian-nmt/marian-examples/tree/336740065d9c23e53e912a1befff18981d9d27ab/training-basics
https://github.com/marian-nmt/marian-examples/tree/336740065d9c23e53e912a1befff18981d9d27ab/training-basics-sentencepiece

Dec 09 '19 03:12 emjotde

HI emjotde,

As you said, I've already done tokenize. Via ./build/marian-vocab </Marian/1_ForTrain/TagRemoved_Source_Train_3318741.tok.en> /Marian/source_vocab.yml command，The result still contains the symbol

Dec 09 '19 03:12 sdlmw

Subword segmentation is your best bet here. See the provided examples for either BPE or SentencePiece. Also, your tokenizer doesn't seem to be particularly good if it kept those words together.

Dec 09 '19 03:12 emjotde

To complement Marcin's response, replacing

--vocabs /Marian/source_vocab.yml /Marian/target_vocab.yml

with

--vocabs /Marian/source_vocab.spm /Marian/target_vocab.spm

should solve the issue, but following the examples mentioned above will allow for better understanding of data pre-processing for NMT.

Dec 09 '19 10:12 snukky

I personally fixed tcmalloc: large alloc ... by updating Cuda. Make sure to completely remove previous installations.

Installation intructions can be found here: https://askubuntu.com/questions/799184/how-can-i-install-cuda-on-ubuntu-16-04

Jan 14 '20 14:01 adjouama

Hm. The tcmalloc: large alloc ... thing isn't really anything that needs to be fixed. It is just an unnecessary log message by Google's libtcmalloc whenever it allocates a larger (actually not that large) chunk of memory. It can be relatively safely ignored. It should also not go away from updating CUDA, these are rather unrelated.

Jan 14 '20 15:01 emjotde

In my case, it's not the log text that bothers. I had memory crash during the training. memory usage increases suddenly between epochs while I have enough GPU memory available.

I use a GTX 1080TI with 11GB memory. I allocate 1GB workspace and still crashes. Before crashing it shows me the tcmalloc: large alloc ...

The only fix that worked for me was updating the Cuda.

Jan 14 '20 15:01 adjouama

Hi! I've running into this problem. I've observed that decreasing --mini-batch kind of mitigates the problem. But why does this problem happens? Why does the memory usage stop increasing? Does it marian apply some kind of cache or the problem is just related to fr-en model?

My command is:

/home/cgarcia/Documentos/experiment_crawling/marian/marian-dev/build/marian-decoder \
  -c /home/cgarcia/Documentos/experiment_crawling/marian/students/fren/fren.student.tiny11/config.intgemm8bitalpha.yml \
  --quiet --max-length-crop --cpu-threads 64 --mini-batch 8

UPDATE: It seems that if --cpu-threads is decreased, it kind of mitigates the problem too.

Sep 18 '22 19:09 cgr71ii

marian marian copied to clipboard

tcmalloc: large alloc 2818572288 bytes == 0x33daa000 @

marian
marian copied to clipboard