deepspeech
deepspeech copied to clipboard
training process is killed because OOM
I have trained neon for librispeech data. But it's always killed because OOM. My machine has 24GB memory and GeForce GTX 1070 card of 8G memory.
I found this msg by dmsg [3017506.733819] Out of memory: Kill process 25635 (python) score 974 or sacrifice child [3017506.736861] Killed process 25635 (python) total-vm:55518724kB, anon-rss:23902876kB, file-rss:154436kB
is neon leaking memory or it require more memory to train?
The command I run is: python train.py --manifest train:/bigdata/lili/deepspeech/librispeech/train-clean-100/train-manifest.csv --manifest val:/bigdata/lili/deepspeech/librispeech/train-clean-100/val-manifest.csv -e 20 -z 16 -s models -b gpu
Try reduce batch size.
I changed batch_size to 8 but it's still killed. [3256824.391743] Killed process 9666 (python) total-vm:53893188kB, anon-rss:23892380kB, file-rss:152808kB
it use too much memory
I suspect the source of the problem is unrelated to the model size. With the default parameters using the command you posted above, I get the following:
batch size | GPU memory footprint |
---|---|
32 |
6949 GB |
16 |
3915 GB |
8 |
2415 GB |
So your 8GB GPU has the capacity to handle a batch size of up to 32.
so what's wrong? From the /var/log. it seems this python process used 23892380kB(23GB) cpu memory(not gpu memory).
[3256824.391743] Killed process 9666 (python) total-vm:53893188kB, anon-rss:23892380kB, file-rss:152808kB