marian-dev
marian-dev copied to clipboard
CUDA OOM Error 2 during validation when using --workspace flag
Bug description
I'm trying to train a model with a corpus of 55M sentences and a devset of 6k sentences. I use a shared sentencepiece vocabulary with size 60k.
The process crashes with Error: CUDA error 2 'out of memory'
I observe a large amount time is spent in shuffling and reading sentences from disk, which is something I expect being a process limited by I/O. The train starts fine and can also run for several updates, but crashes during validation. Marian can correctly calculate ce-mean-words metric on the devset but fails with bleu-detok.
When it try to perform validation, I observe several large consecutive tcmalloc. Each allocation increases the allocated memory for the marian process. This memory usage pattern is shown from htop and I can see that the allocated memory rises to 95-98%. After this phase, the physical memory allocated falls to zero, while the virtual memory for the process goes up. Then another large tcmalloc starts and the physical memory usage increases again. The process goes on until the virtual memory used by the process is around 250GB and the process dies.
I tried to keep the corpus size constant and reducing the workspace memory by step.
Even with -w
8000 the process crashes.
Context
- Marian version:
Marian v1.9.61 d490d461 2021-01-26 02:40:03 -0800
- GPU: Nvidia Tesla T4 16GB
- Log file: train.log
Thanks in advance for the help!