STT Bug: GPU memory (quickly) full error

Bug: GPU memory (quickly) full error

Open SuperKogito opened this issue 2 years ago • 7 comments

Describe the bug After following the Training-quickstart-docu, the installation verification fails even though my system fullfills the requirements. My question here, how much GPU-RAM do I need to get started because 8Gb doesn't seem to be enough. Or maybe someone had a similar issue and can point me toward the cause.

To Reproduce Steps to reproduce the behavior:

Manual Setup installation
Run $ ./bin/run-ldc93s1.sh
My error is the following:

(coqui-stt-train-venv) am@cuda:/trainingdata/DeepSpeech/STT$ ./bin/run-ldc93s1.sh
+ [ ! -f train.py ]
+ [ ! -f data/smoke_test/ldc93s1.csv ]
+ checkpoint_dir=/home/am/.local/share/stt/ldc93s1
+ export CUDA_VISIBLE_DEVICES=1
+ python3 -m coqui_stt_training.train --alphabet_config_path data/alphabet.txt --train_cudnn true --show_progressbar false --train_files data/smoke_test/ldc93s1.csv --test_files data/smoke_test/ldc93s1.csv --train_batch_size 1 --test_batch_size 1 --n_hidden 100 --epochs 200 --checkpoint_dir /home/am/.local/share/stt/ldc93s1
I Performing dummy training to check for memory problems.
I If the following process crashes, you likely have batch sizes that are too big for your available system memory (or GPU memory).
I Could not find best validating checkpoint.
I Could not find most recent checkpoint.
I Initializing all variables.
Traceback (most recent call last):
  File "/trainingdata/DeepSpeech/coqui-stt-train-venv/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1365, in _do_call
    return fn(*args)
  File "/trainingdata/DeepSpeech/coqui-stt-train-venv/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1350, in _run_fn
    target_list, run_metadata)
  File "/trainingdata/DeepSpeech/coqui-stt-train-venv/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1443, in _call_tf_sessionrun
    run_metadata)
tensorflow.python.framework.errors_impl.UnknownError: Fail to find the dnn implementation.
	 [[{{node tower_0/cudnn_lstm/cudnn_lstm/CudnnRNNCanonicalToParams}}]]

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/lib/python3.6/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/usr/lib/python3.6/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/trainingdata/DeepSpeech/STT/training/coqui_stt_training/train.py", line 724, in <module>
    main()
  File "/trainingdata/DeepSpeech/STT/training/coqui_stt_training/train.py", line 694, in main
    train()
  File "/trainingdata/DeepSpeech/STT/training/coqui_stt_training/train.py", line 330, in train
    train_impl(epochs=1, reverse=True, limit=Config.train_batch_size * 3, write=False)
  File "/trainingdata/DeepSpeech/STT/training/coqui_stt_training/train.py", line 445, in train_impl
    load_or_init_graph_for_training(session, silent=silent_load)
  File "/trainingdata/DeepSpeech/STT/training/coqui_stt_training/util/checkpoints.py", line 219, in load_or_init_graph_for_training
    _load_or_init_impl(session, methods, allow_drop_layers=True, silent=silent)
  File "/trainingdata/DeepSpeech/STT/training/coqui_stt_training/util/checkpoints.py", line 194, in _load_or_init_impl
    return _initialize_all_variables(session)
  File "/trainingdata/DeepSpeech/STT/training/coqui_stt_training/util/checkpoints.py", line 110, in _initialize_all_variables
    session.run(v.initializer)
  File "/trainingdata/DeepSpeech/coqui-stt-train-venv/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 956, in run
    run_metadata_ptr)
  File "/trainingdata/DeepSpeech/coqui-stt-train-venv/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1180, in _run
    feed_dict_tensor, options, run_metadata)
  File "/trainingdata/DeepSpeech/coqui-stt-train-venv/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1359, in _do_run
    run_metadata)
  File "/trainingdata/DeepSpeech/coqui-stt-train-venv/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1384, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.UnknownError: Fail to find the dnn implementation.
	 [[node tower_0/cudnn_lstm/cudnn_lstm/CudnnRNNCanonicalToParams (defined at trainingdata/DeepSpeech/coqui-stt-train-venv/lib/python3.6/site-packages/tensorflow_core/python/framework/ops.py:1748) ]]

Original stack trace for 'tower_0/cudnn_lstm/cudnn_lstm/CudnnRNNCanonicalToParams':
  File "usr/lib/python3.6/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "usr/lib/python3.6/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "trainingdata/DeepSpeech/STT/training/coqui_stt_training/train.py", line 724, in <module>
    main()
  File "trainingdata/DeepSpeech/STT/training/coqui_stt_training/train.py", line 694, in main
    train()
  File "trainingdata/DeepSpeech/STT/training/coqui_stt_training/train.py", line 330, in train
    train_impl(epochs=1, reverse=True, limit=Config.train_batch_size * 3, write=False)
  File "trainingdata/DeepSpeech/STT/training/coqui_stt_training/train.py", line 391, in train_impl
    iterator, optimizer, dropout_rates
  File "trainingdata/DeepSpeech/STT/training/coqui_stt_training/train.py", line 173, in get_tower_results
    iterator, dropout_rates, reuse=i > 0
  File "trainingdata/DeepSpeech/STT/training/coqui_stt_training/train.py", line 91, in calculate_mean_edit_distance_and_loss
    batch_x, batch_seq_len, dropout, reuse=reuse, rnn_impl=rnn_impl
  File "trainingdata/DeepSpeech/STT/training/coqui_stt_training/deepspeech_model.py", line 232, in create_model
    output, output_state = rnn_impl(layer_3, seq_length, previous_state, reuse)
  File "trainingdata/DeepSpeech/STT/training/coqui_stt_training/deepspeech_model.py", line 135, in rnn_impl_cudnn_rnn
    inputs=x, sequence_lengths=seq_length
  File "trainingdata/DeepSpeech/coqui-stt-train-venv/lib/python3.6/site-packages/tensorflow_core/python/layers/base.py", line 548, in __call__
    outputs = super(Layer, self).__call__(inputs, *args, **kwargs)
  File "trainingdata/DeepSpeech/coqui-stt-train-venv/lib/python3.6/site-packages/tensorflow_core/python/keras/engine/base_layer.py", line 824, in __call__
    self._maybe_build(inputs)
  File "trainingdata/DeepSpeech/coqui-stt-train-venv/lib/python3.6/site-packages/tensorflow_core/python/keras/engine/base_layer.py", line 2146, in _maybe_build
    self.build(input_shapes)
  File "trainingdata/DeepSpeech/coqui-stt-train-venv/lib/python3.6/site-packages/tensorflow_core/contrib/cudnn_rnn/python/layers/cudnn_rnn.py", line 355, in build
    opaque_params_t = self._canonical_to_opaque(weights, biases)
  File "trainingdata/DeepSpeech/coqui-stt-train-venv/lib/python3.6/site-packages/tensorflow_core/contrib/cudnn_rnn/python/layers/cudnn_rnn.py", line 502, in _canonical_to_opaque
    direction=self._direction)
  File "trainingdata/DeepSpeech/coqui-stt-train-venv/lib/python3.6/site-packages/tensorflow_core/contrib/cudnn_rnn/python/ops/cudnn_rnn_ops.py", line 1582, in cudnn_rnn_canonical_to_opaque_params
    name=name)
  File "trainingdata/DeepSpeech/coqui-stt-train-venv/lib/python3.6/site-packages/tensorflow_core/python/ops/gen_cudnn_rnn_ops.py", line 930, in cudnn_rnn_canonical_to_params
    seed=seed, seed2=seed2, name=name)
  File "trainingdata/DeepSpeech/coqui-stt-train-venv/lib/python3.6/site-packages/tensorflow_core/python/framework/op_def_library.py", line 794, in _apply_op_helper
    op_def=op_def)
  File "trainingdata/DeepSpeech/coqui-stt-train-venv/lib/python3.6/site-packages/tensorflow_core/python/util/deprecation.py", line 507, in new_func
    return func(*args, **kwargs)
  File "trainingdata/DeepSpeech/coqui-stt-train-venv/lib/python3.6/site-packages/tensorflow_core/python/framework/ops.py", line 3357, in create_op
    attrs, op_def, compute_device)
  File "trainingdata/DeepSpeech/coqui-stt-train-venv/lib/python3.6/site-packages/tensorflow_core/python/framework/ops.py", line 3426, in _create_op_internal
    op_def=op_def)
  File "trainingdata/DeepSpeech/coqui-stt-train-venv/lib/python3.6/site-packages/tensorflow_core/python/framework/ops.py", line 1748, in __init__
    self._traceback = tf_stack.extract_stack()

Expected behavior To run the verification error free.

Environment (please complete the following information):

OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Linux cuda 4.15.0-171-generic #180-Ubuntu SMP Wed Mar 2 17:25:05 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
TensorFlow installed from (our builds, or upstream TensorFlow): pip
TensorFlow version (use command below): tensorflow-gpu 1.15.4
Python version: Python 3.6.9 (default, Dec 8 2021, 21:08:43)
Bazel version (if compiling from source): no
GCC/Compiler version (if compiling from source): no
CUDA/cuDNN version: CUDA Version 10.0.130 / cuDNN 7.6.5
GPU model and memory: GeForce RTX 2070 with 8 Gb
Exact command to reproduce: $ ./bin/run-ldc93s1.sh

Additional context I use the almost the same python env setup on the same machine with deepspeech v0.9.3 and it works.

Mar 21 '22 13:03 SuperKogito

The error message indicates you don't have a working setup of CUDA and/or cuDNN:

tensorflow.python.framework.errors_impl.UnknownError: Fail to find the dnn implementation.
	 [[{{node tower_0/cudnn_lstm/cudnn_lstm/CudnnRNNCanonicalToParams}}]]

Mar 23 '22 10:03 reuben

Thank you for your response :) This is a little confusing because it definitely works with DeepSpeech. Is a fresh install the only solution here (I am trying to avoid messing with my CUDA setup)?

Mar 23 '22 10:03 SuperKogito

Can you try with the training Docker image to double check it's not related to the CUDA setup?

Mar 23 '22 19:03 reuben

I just tried to run the verification with the Docker version and the results seems to confirm that it is related to my CUDA setup. Well the code runs correctly but my GPU is not used at all (nvidia-smi shows zeroes) and it is using my CPU. I tried to replace tensorflow with tensorflow-gpu inside of the docker using pip uninstall tensorflow; pip install tensorflow-gpu and that didn't change a thing (I don't work often with Docker so I might have not done it correctly maybe or was this correct?).

Mar 24 '22 17:03 SuperKogito

@SuperKogito In this case I think you forgot to setup docker to work with your nvidia GPU. The official docker image of Coqui-STT already comes with everything you need to train with your GPU, no need to install tensoflow-gpu, it is already installed. See https://docs.docker.com/config/containers/resource_constraints/#gpu to access your GPU with Docker using nvidia-container-runtime.

Apr 25 '22 17:04 wasertech

Thank you for your answer @wasertech. Although this was a great point to raise but unfortunately, when running the tests my GPUs ares still showing zeros (nvidia-smi) and not being used.

Apr 26 '22 13:04 SuperKogito

Update: After a system upgrade and a fresh installation of Cuda & cuDNN, I managed to run the test via Docker. It detected my GPU and worked as expected. However, the build/ run from source issue still persists. I suggest closing this for now since it seems I am the only one getting this issue atm.

Jun 23 '22 15:06 SuperKogito

STT STT copied to clipboard

Bug: GPU memory (quickly) full error

STT
STT copied to clipboard