wenet icon indicating copy to clipboard operation
wenet copied to clipboard

training error

Open wangjg19 opened this issue 3 years ago • 5 comments

I installed the wenet according to the document. When training the Conformer network, the CUDA memory usage will increase with the increase of batch. For example, from 18000MB to 29000MB after training for an hour. An error (out of memory) exit will be reported shortly. Has anyone had this problem?

wangjg19 avatar Nov 05 '21 09:11 wangjg19

You could try to reduce batch_size parameter in configuration file (conf/train_.yaml) and restart the training.

Freddy-pp avatar Nov 05 '21 12:11 Freddy-pp

Solutions u can try:

  1. reduce batch_size link
  2. reduce max_length link
  3. use dynamic batch_type link

xingchensong avatar Nov 06 '21 03:11 xingchensong

Solutions u can try:

1. reduce `batch_size` [link](https://github.com/wenet-e2e/wenet/blob/main/examples/aishell/s0/conf/train_conformer.yaml#L65)

2. reduce `max_length` [link](https://github.com/wenet-e2e/wenet/blob/main/examples/aishell/s0/conf/train_conformer.yaml#L39)

3. use `dynamic` batch_type [link](https://github.com/wenet-e2e/wenet/blob/main/examples/aishell/s0/conf/train_conformer.yaml#L64)

Thanks and now my setting is below: batch_size = 16 max_length = 20480 batch_type = 'dynamic'

The problem is still remaining......

wangjg19 avatar Nov 06 '21 09:11 wangjg19

try to set batch_size=8 or max_length=10240

YuLong-Liang avatar Feb 26 '22 08:02 YuLong-Liang

@xingchensong hello, if I train with 8k data. the resample_rate should set to 8000 right? and the max_length also should be small to avoid OOM?

kli017 avatar Jun 17 '22 06:06 kli017

yes

xingchensong avatar Feb 21 '23 05:02 xingchensong