STT
STT copied to clipboard
Bug: GPU memory (quickly) full error
Describe the bug After following the Training-quickstart-docu, the installation verification fails even though my system fullfills the requirements. My question here, how much GPU-RAM do I need to get started because 8Gb doesn't seem to be enough. Or maybe someone had a similar issue and can point me toward the cause.
To Reproduce Steps to reproduce the behavior:
- Manual Setup installation
- Run
$ ./bin/run-ldc93s1.sh
- My error is the following:
(coqui-stt-train-venv) am@cuda:/trainingdata/DeepSpeech/STT$ ./bin/run-ldc93s1.sh
+ [ ! -f train.py ]
+ [ ! -f data/smoke_test/ldc93s1.csv ]
+ checkpoint_dir=/home/am/.local/share/stt/ldc93s1
+ export CUDA_VISIBLE_DEVICES=1
+ python3 -m coqui_stt_training.train --alphabet_config_path data/alphabet.txt --train_cudnn true --show_progressbar false --train_files data/smoke_test/ldc93s1.csv --test_files data/smoke_test/ldc93s1.csv --train_batch_size 1 --test_batch_size 1 --n_hidden 100 --epochs 200 --checkpoint_dir /home/am/.local/share/stt/ldc93s1
I Performing dummy training to check for memory problems.
I If the following process crashes, you likely have batch sizes that are too big for your available system memory (or GPU memory).
I Could not find best validating checkpoint.
I Could not find most recent checkpoint.
I Initializing all variables.
Traceback (most recent call last):
File "/trainingdata/DeepSpeech/coqui-stt-train-venv/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1365, in _do_call
return fn(*args)
File "/trainingdata/DeepSpeech/coqui-stt-train-venv/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1350, in _run_fn
target_list, run_metadata)
File "/trainingdata/DeepSpeech/coqui-stt-train-venv/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1443, in _call_tf_sessionrun
run_metadata)
tensorflow.python.framework.errors_impl.UnknownError: Fail to find the dnn implementation.
[[{{node tower_0/cudnn_lstm/cudnn_lstm/CudnnRNNCanonicalToParams}}]]
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/lib/python3.6/runpy.py", line 193, in _run_module_as_main
"__main__", mod_spec)
File "/usr/lib/python3.6/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/trainingdata/DeepSpeech/STT/training/coqui_stt_training/train.py", line 724, in <module>
main()
File "/trainingdata/DeepSpeech/STT/training/coqui_stt_training/train.py", line 694, in main
train()
File "/trainingdata/DeepSpeech/STT/training/coqui_stt_training/train.py", line 330, in train
train_impl(epochs=1, reverse=True, limit=Config.train_batch_size * 3, write=False)
File "/trainingdata/DeepSpeech/STT/training/coqui_stt_training/train.py", line 445, in train_impl
load_or_init_graph_for_training(session, silent=silent_load)
File "/trainingdata/DeepSpeech/STT/training/coqui_stt_training/util/checkpoints.py", line 219, in load_or_init_graph_for_training
_load_or_init_impl(session, methods, allow_drop_layers=True, silent=silent)
File "/trainingdata/DeepSpeech/STT/training/coqui_stt_training/util/checkpoints.py", line 194, in _load_or_init_impl
return _initialize_all_variables(session)
File "/trainingdata/DeepSpeech/STT/training/coqui_stt_training/util/checkpoints.py", line 110, in _initialize_all_variables
session.run(v.initializer)
File "/trainingdata/DeepSpeech/coqui-stt-train-venv/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 956, in run
run_metadata_ptr)
File "/trainingdata/DeepSpeech/coqui-stt-train-venv/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1180, in _run
feed_dict_tensor, options, run_metadata)
File "/trainingdata/DeepSpeech/coqui-stt-train-venv/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1359, in _do_run
run_metadata)
File "/trainingdata/DeepSpeech/coqui-stt-train-venv/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1384, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.UnknownError: Fail to find the dnn implementation.
[[node tower_0/cudnn_lstm/cudnn_lstm/CudnnRNNCanonicalToParams (defined at trainingdata/DeepSpeech/coqui-stt-train-venv/lib/python3.6/site-packages/tensorflow_core/python/framework/ops.py:1748) ]]
Original stack trace for 'tower_0/cudnn_lstm/cudnn_lstm/CudnnRNNCanonicalToParams':
File "usr/lib/python3.6/runpy.py", line 193, in _run_module_as_main
"__main__", mod_spec)
File "usr/lib/python3.6/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "trainingdata/DeepSpeech/STT/training/coqui_stt_training/train.py", line 724, in <module>
main()
File "trainingdata/DeepSpeech/STT/training/coqui_stt_training/train.py", line 694, in main
train()
File "trainingdata/DeepSpeech/STT/training/coqui_stt_training/train.py", line 330, in train
train_impl(epochs=1, reverse=True, limit=Config.train_batch_size * 3, write=False)
File "trainingdata/DeepSpeech/STT/training/coqui_stt_training/train.py", line 391, in train_impl
iterator, optimizer, dropout_rates
File "trainingdata/DeepSpeech/STT/training/coqui_stt_training/train.py", line 173, in get_tower_results
iterator, dropout_rates, reuse=i > 0
File "trainingdata/DeepSpeech/STT/training/coqui_stt_training/train.py", line 91, in calculate_mean_edit_distance_and_loss
batch_x, batch_seq_len, dropout, reuse=reuse, rnn_impl=rnn_impl
File "trainingdata/DeepSpeech/STT/training/coqui_stt_training/deepspeech_model.py", line 232, in create_model
output, output_state = rnn_impl(layer_3, seq_length, previous_state, reuse)
File "trainingdata/DeepSpeech/STT/training/coqui_stt_training/deepspeech_model.py", line 135, in rnn_impl_cudnn_rnn
inputs=x, sequence_lengths=seq_length
File "trainingdata/DeepSpeech/coqui-stt-train-venv/lib/python3.6/site-packages/tensorflow_core/python/layers/base.py", line 548, in __call__
outputs = super(Layer, self).__call__(inputs, *args, **kwargs)
File "trainingdata/DeepSpeech/coqui-stt-train-venv/lib/python3.6/site-packages/tensorflow_core/python/keras/engine/base_layer.py", line 824, in __call__
self._maybe_build(inputs)
File "trainingdata/DeepSpeech/coqui-stt-train-venv/lib/python3.6/site-packages/tensorflow_core/python/keras/engine/base_layer.py", line 2146, in _maybe_build
self.build(input_shapes)
File "trainingdata/DeepSpeech/coqui-stt-train-venv/lib/python3.6/site-packages/tensorflow_core/contrib/cudnn_rnn/python/layers/cudnn_rnn.py", line 355, in build
opaque_params_t = self._canonical_to_opaque(weights, biases)
File "trainingdata/DeepSpeech/coqui-stt-train-venv/lib/python3.6/site-packages/tensorflow_core/contrib/cudnn_rnn/python/layers/cudnn_rnn.py", line 502, in _canonical_to_opaque
direction=self._direction)
File "trainingdata/DeepSpeech/coqui-stt-train-venv/lib/python3.6/site-packages/tensorflow_core/contrib/cudnn_rnn/python/ops/cudnn_rnn_ops.py", line 1582, in cudnn_rnn_canonical_to_opaque_params
name=name)
File "trainingdata/DeepSpeech/coqui-stt-train-venv/lib/python3.6/site-packages/tensorflow_core/python/ops/gen_cudnn_rnn_ops.py", line 930, in cudnn_rnn_canonical_to_params
seed=seed, seed2=seed2, name=name)
File "trainingdata/DeepSpeech/coqui-stt-train-venv/lib/python3.6/site-packages/tensorflow_core/python/framework/op_def_library.py", line 794, in _apply_op_helper
op_def=op_def)
File "trainingdata/DeepSpeech/coqui-stt-train-venv/lib/python3.6/site-packages/tensorflow_core/python/util/deprecation.py", line 507, in new_func
return func(*args, **kwargs)
File "trainingdata/DeepSpeech/coqui-stt-train-venv/lib/python3.6/site-packages/tensorflow_core/python/framework/ops.py", line 3357, in create_op
attrs, op_def, compute_device)
File "trainingdata/DeepSpeech/coqui-stt-train-venv/lib/python3.6/site-packages/tensorflow_core/python/framework/ops.py", line 3426, in _create_op_internal
op_def=op_def)
File "trainingdata/DeepSpeech/coqui-stt-train-venv/lib/python3.6/site-packages/tensorflow_core/python/framework/ops.py", line 1748, in __init__
self._traceback = tf_stack.extract_stack()
Expected behavior To run the verification error free.
Environment (please complete the following information):
- OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Linux cuda 4.15.0-171-generic #180-Ubuntu SMP Wed Mar 2 17:25:05 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
- TensorFlow installed from (our builds, or upstream TensorFlow): pip
- TensorFlow version (use command below): tensorflow-gpu 1.15.4
- Python version: Python 3.6.9 (default, Dec 8 2021, 21:08:43)
- Bazel version (if compiling from source): no
- GCC/Compiler version (if compiling from source): no
- CUDA/cuDNN version: CUDA Version 10.0.130 / cuDNN 7.6.5
- GPU model and memory: GeForce RTX 2070 with 8 Gb
-
Exact command to reproduce:
$ ./bin/run-ldc93s1.sh
Additional context I use the almost the same python env setup on the same machine with deepspeech v0.9.3 and it works.
The error message indicates you don't have a working setup of CUDA and/or cuDNN:
tensorflow.python.framework.errors_impl.UnknownError: Fail to find the dnn implementation.
[[{{node tower_0/cudnn_lstm/cudnn_lstm/CudnnRNNCanonicalToParams}}]]
Thank you for your response :) This is a little confusing because it definitely works with DeepSpeech. Is a fresh install the only solution here (I am trying to avoid messing with my CUDA setup)?
Can you try with the training Docker image to double check it's not related to the CUDA setup?
I just tried to run the verification with the Docker version and the results seems to confirm that it is related to my CUDA setup.
Well the code runs correctly but my GPU is not used at all (nvidia-smi
shows zeroes) and it is using my CPU. I tried to replace tensorflow
with tensorflow-gpu
inside of the docker using pip uninstall tensorflow; pip install tensorflow-gpu
and that didn't change a thing (I don't work often with Docker so I might have not done it correctly maybe or was this correct?).
@SuperKogito In this case I think you forgot to setup docker to work with your nvidia GPU.
The official docker image of Coqui-STT already comes with everything you need to train with your GPU, no need to install tensoflow-gpu
, it is already installed.
See https://docs.docker.com/config/containers/resource_constraints/#gpu to access your GPU with Docker using nvidia-container-runtime
.
Thank you for your answer @wasertech.
Although this was a great point to raise but unfortunately, when running the tests my GPUs ares still showing zeros (nvidia-smi
) and not being used.
Update: After a system upgrade and a fresh installation of Cuda & cuDNN, I managed to run the test via Docker. It detected my GPU and worked as expected. However, the build/ run from source issue still persists. I suggest closing this for now since it seems I am the only one getting this issue atm.