[NeMo 2.6] RNNT ASR inference fails on A100 (CUDA 12.8, PyTorch 2.9) with CUDA Graphs error CUDA failure! 35
Describe the bug
Running RNNT ASR inference with NeMo 2.6.0 on an NVIDIA A100 (CUDA 12.8, PyTorch 2.9.1+cu128) fails during decoding due to a CUDA Graphs initialization error:
Transcribing: 0it [00:00, ?it/s]
Traceback (most recent call last):
File "/scratch/amlt_code/empty_template.py", line 3, in
output = asr_model.transcribe(['2086-149220-0033.wav'])
File "/home/aiscuser/.local/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 120, in decorate_context
return func(*args, **kwargs)
File "/home/aiscuser/.local/lib/python3.10/site-packages/nemo/collections/asr/models/rnnt_models.py", line 306, in transcribe
return super().transcribe(
File "/home/aiscuser/.local/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 120, in decorate_context
return func(*args, **kwargs)
File "/home/aiscuser/.local/lib/python3.10/site-packages/nemo/collections/asr/parts/mixins/transcription.py", line 270, in transcribe
for processed_outputs in generator:
File "/home/aiscuser/.local/lib/python3.10/site-packages/nemo/collections/asr/parts/mixins/transcription.py", line 370, in transcribe_generator
processed_outputs = self._transcribe_output_processing(model_outputs, transcribe_cfg)
File "/home/aiscuser/.local/lib/python3.10/site-packages/nemo/collections/asr/models/rnnt_models.py", line 944, in _transcribe_output_processing
hyp = self.decoding.rnnt_decoder_predictions_tensor(
File "/home/aiscuser/.local/lib/python3.10/site-packages/nemo/collections/asr/parts/submodules/rnnt_decoding.py", line 717, in rnnt_decoder_predictions_tensor
hypotheses_list = self.decoding(
File "/home/aiscuser/.local/lib/python3.10/site-packages/nemo/collections/asr/parts/submodules/rnnt_greedy_decoding.py", line 201, in call
return self.forward(*args, **kwargs)
File "/home/aiscuser/.local/lib/python3.10/site-packages/nemo/core/classes/common.py", line 1204, in wrapped_call
outputs = wrapped(*args, **kwargs)
File "/home/aiscuser/.local/lib/python3.10/site-packages/nemo/collections/asr/parts/submodules/rnnt_greedy_decoding.py", line 760, in forward
hypotheses = self._greedy_decode(
File "/home/aiscuser/.local/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 120, in decorate_context
return func(*args, **kwargs)
File "/home/aiscuser/.local/lib/python3.10/site-packages/nemo/collections/asr/parts/submodules/rnnt_greedy_decoding.py", line 791, in _greedy_decode_blank_as_pad_loop_labels
batched_hyps, alignments, batched_state = self.decoding_computer(
File "/home/aiscuser/.local/lib/python3.10/site-packages/nemo/collections/asr/parts/submodules/transducer_decoding/label_looping_base.py", line 217, in call
return self.cuda_graphs_impl(
File "/home/aiscuser/.local/lib/python3.10/site-packages/nemo/collections/asr/parts/submodules/transducer_decoding/rnnt_label_looping.py", line 686, in cuda_graphs_impl
self._graph_reinitialize(encoder_output, encoder_output_length)
File "/home/aiscuser/.local/lib/python3.10/site-packages/nemo/collections/asr/parts/submodules/transducer_decoding/rnnt_label_looping.py", line 863, in _graph_reinitialize
self._full_graph_compile()
File "/home/aiscuser/.local/lib/python3.10/site-packages/nemo/collections/asr/parts/submodules/transducer_decoding/rnnt_label_looping.py", line 946, in _full_graph_compile
capture_status, _, graph, _, _, _ = cu_call(
File "/home/aiscuser/.local/lib/python3.10/site-packages/nemo/core/utils/cuda_python_utils.py", line 101, in cu_call
raise Exception(f"CUDA failure! {error}")
Exception: CUDA failure! 35
Steps/Code to reproduce bug
Reproduction Code
import nemo.collections.asr as nemo_asr
asr_model = nemo_asr.models.ASRModel.from_pretrained("nvidia/stt_en_conformer_transducer_large")
output = asr_model.transcribe(["2086-149220-0033.wav"])
print(output)
Environment overview
| Component | Version |
|---|---|
| GPU | NVIDIA A100 80GB PCIe |
| Driver | 570.133.20 |
| CUDA Runtime | 12.8 |
| PyTorch | 2.9.1+cu128 |
| Torchaudio | 2.3.1 |
| NeMo | 2.6.0 |
| Python | 3.10.14 |
FYI also getting very similar error with a similar (but not quite identical) setup
Environment overview
| Component | Version |
|---|---|
| GPU | NVIDIA A100-SXM4-80GB |
| Driver | 570.148.08 |
| CUDA Runtime | 12.8 |
| torch | 2.8.0+cu128 |
| torchaudio | 2.8.0+cu128 |
| NeMo | v2.6.0 tag |
| Python | 3.10.12 |
Reproducing code
import nemo.collections.asr as nemo_asr
asr_model = nemo_asr.models.ASRModel.from_pretrained("nvidia/multitalker-parakeet-streaming-0.6b-v1")
WAV_FILE = 'notebooks/test_data/test_file.wav'
asr_output = asr_model.transcribe([WAV_FILE])
Stack trace
Traceback (most recent call last):
File "./gputest.py", line 6, in <module>
asr_output = asr_model.transcribe([WAV_FILE])
File "/opt/conda/envs/nemo/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 120, in decorate_context
return func(*args, **kwargs)
File "/home/jovyan/NeMo/nemo/collections/asr/models/rnnt_models.py", line 306, in transcribe
return super().transcribe(
File "/opt/conda/envs/nemo/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 120, in decorate_context
return func(*args, **kwargs)
File "/home/jovyan/NeMo/nemo/collections/asr/parts/mixins/transcription.py", line 270, in transcribe
for processed_outputs in generator:
File "/home/jovyan/NeMo/nemo/collections/asr/parts/mixins/transcription.py", line 370, in transcribe_generator
processed_outputs = self._transcribe_output_processing(model_outputs, transcribe_cfg)
File "/home/jovyan/NeMo/nemo/collections/asr/models/rnnt_models.py", line 944, in _transcribe_output_processing
hyp = self.decoding.rnnt_decoder_predictions_tensor(
File "/home/jovyan/NeMo/nemo/collections/asr/parts/submodules/rnnt_decoding.py", line 717, in rnnt_decoder_predictions_tensor
hypotheses_list = self.decoding(
File "/home/jovyan/NeMo/nemo/collections/asr/parts/submodules/rnnt_greedy_decoding.py", line 201, in __call__
return self.forward(*args, **kwargs)
File "/home/jovyan/NeMo/nemo/core/classes/common.py", line 1204, in wrapped_call
outputs = wrapped(*args, **kwargs)
File "/home/jovyan/NeMo/nemo/collections/asr/parts/submodules/rnnt_greedy_decoding.py", line 760, in forward
hypotheses = self._greedy_decode(
File "/opt/conda/envs/nemo/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 120, in decorate_context
return func(*args, **kwargs)
File "/home/jovyan/NeMo/nemo/collections/asr/parts/submodules/rnnt_greedy_decoding.py", line 791, in _greedy_decode_blank_as_pad_loop_labels
batched_hyps, alignments, batched_state = self.decoding_computer(
File "/home/jovyan/NeMo/nemo/collections/asr/parts/submodules/transducer_decoding/label_looping_base.py", line 217, in __call__
return self.cuda_graphs_impl(
File "/home/jovyan/NeMo/nemo/collections/asr/parts/submodules/transducer_decoding/rnnt_label_looping.py", line 686, in cuda_graphs_impl
self._graph_reinitialize(encoder_output, encoder_output_length)
File "/home/jovyan/NeMo/nemo/collections/asr/parts/submodules/transducer_decoding/rnnt_label_looping.py", line 863, in _graph_reinitialize
self._full_graph_compile()
File "/home/jovyan/NeMo/nemo/collections/asr/parts/submodules/transducer_decoding/rnnt_label_looping.py", line 946, in _full_graph_compile
capture_status, _, graph, _, _, _ = cu_call(
File "/home/jovyan/NeMo/nemo/core/utils/cuda_python_utils.py", line 101, in cu_call
raise Exception(f"CUDA failure! {error}")
Exception: CUDA failure! 35
Workarounds
-
Yes, downgrading NeMo to v2.5.3 fixes the problem - unfortunately since I want to work with the latest multitalker models which were added in v2.6.0 that is not an option for me.
-
Adding
asr_model.decoding.decoding.decoding_computer.disable_cuda_graphs()
right before transcribe also avoids the problem. Presumably at the expense of not doing this https://arxiv.org/abs/2406.06220 😢
Workarounds
- Yes, downgrading NeMo to v2.5.3 fixes the problem - unfortunately since I want to work with the latest multitalker models which were added in v2.6.0 that is not an option for me.
- Adding
asr_model.decoding.decoding.decoding_computer.disable_cuda_graphs()right before transcribe also avoids the problem. Presumably at the expense of not doing this https://arxiv.org/abs/2406.06220 😢
Thank you so much! Downgrading NeMo to v2.5.3 solved the issue for me.
@dorispei, @utunga Please check https://github.com/NVIDIA-NeMo/NeMo/pull/15173 That PR adds a fallback option to use native PyTorch CUDA graphs if full graph compilation failed. Should work a bit slower than default full CUDA graphs, but still preserve most of the speed (unlike the solution with disabling CUDA graphs).
thanks @artbataev that looks great. I see that it has been merged to main so i did a git pull (from this repo) and commented out my workaround
# comented out - so i guess cuda_graphs are enabled..?
# asr_model.decoding.decoding.decoding_computer.disable_cuda_graphs()
Looks like it still failed to compile though for this context (doing a finetune). Cf this comment
NeMo I 2025-12-13 02:59:52 asr_model:209] CUDA graphs disabled for EncDecMultiTalkerRNNTBPEModel::RNNTBPEDecoding::GreedyBatchedRNNTInfer
Epoch 0: |
| 1/? [00:04<00:00, 0.22it/s, v_num=fchv, train_step_timing in s=3.770[NeMo I 2025-12-13 03:01:44 asr_model:224] CUDA graphs enabled for EncDecMultiTalkerRNNTBPEModel::RNNTBPEDecoding::GreedyBatchedRNNTInfer | 0/? [00:00<?, ?it/s]
[NeMo W 2025-12-13 03:01:45 rnnt_label_looping:688] Full CUDA graph compilation failed: CUDA failure! 35. Falling back to native PyTorch CUDA graphs. Decoding will be slower. | 0/? [00:00<?, ?it/s]
[NeMo I 2025-12-13 03:01:48 metric:549]
So it looks like it is training is going ahead - just slower.
Appreciate the work on this, thought I'd provide the feedback. Thanks!