rnnt-speech-recognition icon indicating copy to clipboard operation
rnnt-speech-recognition copied to clipboard

RAM OOM Problem

Open kjh21212 opened this issue 4 years ago • 11 comments

when i run your code it happened RAM OOM in eval part i don't know why happened this problem? my desktop ram size is 128GB and using 4-ways gpu and it was increase memory every eval batch also 4-ways gpu batch process speed is slower than single gpu

kjh21212 avatar Apr 14 '20 01:04 kjh21212

@kjh21212 I'm facing the same RAM issue, were you able to solve it?

prajwaljpj avatar Apr 19 '20 12:04 prajwaljpj

I have same issue. My system is

RAM : 128GB GPU : GTX 1080ti * 4 OS : ubuntu 18.04 NVIDIA Driver : 440.82 CUDA : 10.1 CUDNN : 7.6.5 python : 3.6.9 tensorflow & tensorflow-gpu : 2.1.0 (And I do not change any param in run_common_voice.py)

When I run the run_common_voice.py code. These are shown.

  1. At the 0th epoch Eval_step is running with retracing warning and then, I got the OOM error.

  2. Disable evaluation at the 0th epoch. 2-1. When there is retracing warning (slow) Epoch: 0, Batch: 60, Global Step: 60, Step Time: 26.0310, Loss: 165.6244 2-2. When there is no retracing warning (fast) Epoch: 0, Batch: 62, Global Step: 62, Step Time: 6.3741, Loss: 164.6387

    Then I get the OOM error after this line Epoch: 0, Batch: 226, Global Step: 226, Step Time: 5.9092, Loss: 142.7257 ...

I think some of the tf.function? affect to speed of the training.

Does the retracing warning have a connection with OOM error? --> If so, how can I solve the retracing warning? --> If not, how can I solve the OOM error?

NAM-hj avatar Apr 23 '20 01:04 NAM-hj

@nambee Seems like there's something with GradientTape, RNN layers or TFRecords. I implemented DeepSpeech2 with tfrecord dataset in keras and when I trained it using .fit function, no OOM error, but when I trained using GradientTape, the memory kept going up and then OOM. However, when I trained SEGAN (No recurrent network, only Conv) with a generator dataset using GradientTape, it worked fine.

nglehuy avatar May 10 '20 13:05 nglehuy

Please try again with the latest commit. I have updated it to use Tensorflow 2.2.0 and solved the retracing issue

noahchalifour avatar May 14 '20 18:05 noahchalifour

@noahchalifour Just executed the current repository code with one GPU. I am also running into the OOM error also using a GeForce GTX 1080 Ti card.

stefan-falk avatar May 26 '20 07:05 stefan-falk

I have figured out that if we use tf.data.TFRecordDataset, then wraping whole dataset loop with @tf.function can avoid RAM OOM (and also train faster), like:

@tf.function
def train():
    for batch in train_dataset:
        train_step(batch)

The downside of this trick is we can't use native python functions and unimplemented tf functions in graph mode (like tf.train.Checkpoint.save()). However, we can use tf.py_function or tf.numpy_function to run them, but we have to run tf.distribute.Server if we want to train using multi-gpus, this limitation is mentioned here: https://www.tensorflow.org/api_docs/python/tf/numpy_function?hl=en

nglehuy avatar Jun 15 '20 09:06 nglehuy

@usimarit Are you able to train/use the model? I can only afford a very small batch size (4-8 samples) when running on a single GeForce 1080 Ti (~11 GB RAM) and I am not even sure if it's working.

How long did you have train your model?

stefan-falk avatar Jul 15 '20 10:07 stefan-falk

@usimarit Are you able to train/use the model? I can only afford a very small batch size (4-8 samples) when running on a single GeForce 1080 Ti (~11 GB RAM) and I am not even sure if it's working.

How long did you have train your model?

I guess small batch size is normal for ASR models. I trained a ctc model on rtx 2080ti 11G on about 300hours dataset and it took 3 days for 12 epochs with batch size 4. But this issue is about RAM OOM, not GPU VRAM OOM :)) I've tested multiple times using tfrecorddataset and it seems like there is some bugs when iterating it using for loop.

nglehuy avatar Jul 15 '20 13:07 nglehuy

@usimarit Oh, I misinterpreted the issue the.

Yeah, that batch size size what I am using too. Didn't expect such a small batch size to work out :)

stefan-falk avatar Jul 16 '20 10:07 stefan-falk

Please try again with the latest commit. I have updated it to use Tensorflow 2.2.0 and solved the retracing issue

@noahchalifour But I'am also facing the problem even with using Tensorflow2.2.0 and the latest commit.

malixian avatar Nov 14 '20 08:11 malixian

I have figured out that if we use tf.data.TFRecordDataset, then wraping whole dataset loop with @tf.function can avoid RAM OOM (and also train faster), like:

@tf.function
def train():
    for batch in train_dataset:
        train_step(batch)

The downside of this trick is we can't use native python functions and unimplemented tf functions in graph mode (like tf.train.Checkpoint.save()). However, we can use tf.py_function or tf.numpy_function to run them, but we have to run tf.distribute.Server if we want to train using multi-gpus, this limitation is mentioned here: https://www.tensorflow.org/api_docs/python/tf/numpy_function?hl=en

@usimarit I have tried it, but it still doesn't work

malixian avatar Nov 14 '20 08:11 malixian