NeMo [NeMo 2.6] RNNT ASR inference fails on A100 (CUDA 12.8, PyTorch 2.9) with CUDA Graphs error CUDA failure! 35

Describe the bug

Running RNNT ASR inference with NeMo 2.6.0 on an NVIDIA A100 (CUDA 12.8, PyTorch 2.9.1+cu128) fails during decoding due to a CUDA Graphs initialization error:

Transcribing: 0it [00:00, ?it/s]
Traceback (most recent call last):
File "/scratch/amlt_code/empty_template.py", line 3, in
output = asr_model.transcribe(['2086-149220-0033.wav'])
File "/home/aiscuser/.local/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 120, in decorate_context
return func(*args, **kwargs)
File "/home/aiscuser/.local/lib/python3.10/site-packages/nemo/collections/asr/models/rnnt_models.py", line 306, in transcribe
return super().transcribe(
File "/home/aiscuser/.local/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 120, in decorate_context
return func(*args, **kwargs)
File "/home/aiscuser/.local/lib/python3.10/site-packages/nemo/collections/asr/parts/mixins/transcription.py", line 270, in transcribe
for processed_outputs in generator:
File "/home/aiscuser/.local/lib/python3.10/site-packages/nemo/collections/asr/parts/mixins/transcription.py", line 370, in transcribe_generator
processed_outputs = self._transcribe_output_processing(model_outputs, transcribe_cfg)
File "/home/aiscuser/.local/lib/python3.10/site-packages/nemo/collections/asr/models/rnnt_models.py", line 944, in _transcribe_output_processing
hyp = self.decoding.rnnt_decoder_predictions_tensor(
File "/home/aiscuser/.local/lib/python3.10/site-packages/nemo/collections/asr/parts/submodules/rnnt_decoding.py", line 717, in rnnt_decoder_predictions_tensor
hypotheses_list = self.decoding(
File "/home/aiscuser/.local/lib/python3.10/site-packages/nemo/collections/asr/parts/submodules/rnnt_greedy_decoding.py", line 201, in call
return self.forward(*args, **kwargs)
File "/home/aiscuser/.local/lib/python3.10/site-packages/nemo/core/classes/common.py", line 1204, in wrapped_call
outputs = wrapped(*args, **kwargs)
File "/home/aiscuser/.local/lib/python3.10/site-packages/nemo/collections/asr/parts/submodules/rnnt_greedy_decoding.py", line 760, in forward
hypotheses = self._greedy_decode(
File "/home/aiscuser/.local/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 120, in decorate_context
return func(*args, **kwargs)
File "/home/aiscuser/.local/lib/python3.10/site-packages/nemo/collections/asr/parts/submodules/rnnt_greedy_decoding.py", line 791, in _greedy_decode_blank_as_pad_loop_labels
batched_hyps, alignments, batched_state = self.decoding_computer(
File "/home/aiscuser/.local/lib/python3.10/site-packages/nemo/collections/asr/parts/submodules/transducer_decoding/label_looping_base.py", line 217, in call
return self.cuda_graphs_impl(
File "/home/aiscuser/.local/lib/python3.10/site-packages/nemo/collections/asr/parts/submodules/transducer_decoding/rnnt_label_looping.py", line 686, in cuda_graphs_impl
self._graph_reinitialize(encoder_output, encoder_output_length)
File "/home/aiscuser/.local/lib/python3.10/site-packages/nemo/collections/asr/parts/submodules/transducer_decoding/rnnt_label_looping.py", line 863, in _graph_reinitialize
self._full_graph_compile()
File "/home/aiscuser/.local/lib/python3.10/site-packages/nemo/collections/asr/parts/submodules/transducer_decoding/rnnt_label_looping.py", line 946, in _full_graph_compile
capture_status, _, graph, _, _, _ = cu_call(
File "/home/aiscuser/.local/lib/python3.10/site-packages/nemo/core/utils/cuda_python_utils.py", line 101, in cu_call
raise Exception(f"CUDA failure! {error}")
Exception: CUDA failure! 35

Steps/Code to reproduce bug

Reproduction Code

import nemo.collections.asr as nemo_asr

asr_model = nemo_asr.models.ASRModel.from_pretrained("nvidia/stt_en_conformer_transducer_large")
output = asr_model.transcribe(["2086-149220-0033.wav"])
print(output)

Environment overview

Component	Version
GPU	NVIDIA A100 80GB PCIe
Driver	570.133.20
CUDA Runtime	12.8
PyTorch	2.9.1+cu128
Torchaudio	2.3.1
NeMo	2.6.0
Python	3.10.14

Dec 04 '25 10:12 dorispei

FYI also getting very similar error with a similar (but not quite identical) setup

Environment overview

Component	Version
GPU	NVIDIA A100-SXM4-80GB
Driver	570.148.08
CUDA Runtime	12.8
torch	2.8.0+cu128
torchaudio	2.8.0+cu128
NeMo	v2.6.0 tag
Python	3.10.12

Reproducing code

import nemo.collections.asr as nemo_asr

asr_model = nemo_asr.models.ASRModel.from_pretrained("nvidia/multitalker-parakeet-streaming-0.6b-v1")
WAV_FILE = 'notebooks/test_data/test_file.wav'
asr_output = asr_model.transcribe([WAV_FILE])

Stack trace

Traceback (most recent call last):
  File "./gputest.py", line 6, in <module>
    asr_output = asr_model.transcribe([WAV_FILE])
  File "/opt/conda/envs/nemo/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 120, in decorate_context
    return func(*args, **kwargs)
  File "/home/jovyan/NeMo/nemo/collections/asr/models/rnnt_models.py", line 306, in transcribe
    return super().transcribe(
  File "/opt/conda/envs/nemo/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 120, in decorate_context
    return func(*args, **kwargs)
  File "/home/jovyan/NeMo/nemo/collections/asr/parts/mixins/transcription.py", line 270, in transcribe
    for processed_outputs in generator:
  File "/home/jovyan/NeMo/nemo/collections/asr/parts/mixins/transcription.py", line 370, in transcribe_generator
    processed_outputs = self._transcribe_output_processing(model_outputs, transcribe_cfg)
  File "/home/jovyan/NeMo/nemo/collections/asr/models/rnnt_models.py", line 944, in _transcribe_output_processing
    hyp = self.decoding.rnnt_decoder_predictions_tensor(
  File "/home/jovyan/NeMo/nemo/collections/asr/parts/submodules/rnnt_decoding.py", line 717, in rnnt_decoder_predictions_tensor
    hypotheses_list = self.decoding(
  File "/home/jovyan/NeMo/nemo/collections/asr/parts/submodules/rnnt_greedy_decoding.py", line 201, in __call__
    return self.forward(*args, **kwargs)
  File "/home/jovyan/NeMo/nemo/core/classes/common.py", line 1204, in wrapped_call
    outputs = wrapped(*args, **kwargs)
  File "/home/jovyan/NeMo/nemo/collections/asr/parts/submodules/rnnt_greedy_decoding.py", line 760, in forward
    hypotheses = self._greedy_decode(
  File "/opt/conda/envs/nemo/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 120, in decorate_context
    return func(*args, **kwargs)
  File "/home/jovyan/NeMo/nemo/collections/asr/parts/submodules/rnnt_greedy_decoding.py", line 791, in _greedy_decode_blank_as_pad_loop_labels
    batched_hyps, alignments, batched_state = self.decoding_computer(
  File "/home/jovyan/NeMo/nemo/collections/asr/parts/submodules/transducer_decoding/label_looping_base.py", line 217, in __call__
    return self.cuda_graphs_impl(
  File "/home/jovyan/NeMo/nemo/collections/asr/parts/submodules/transducer_decoding/rnnt_label_looping.py", line 686, in cuda_graphs_impl
    self._graph_reinitialize(encoder_output, encoder_output_length)
  File "/home/jovyan/NeMo/nemo/collections/asr/parts/submodules/transducer_decoding/rnnt_label_looping.py", line 863, in _graph_reinitialize
    self._full_graph_compile()
  File "/home/jovyan/NeMo/nemo/collections/asr/parts/submodules/transducer_decoding/rnnt_label_looping.py", line 946, in _full_graph_compile
    capture_status, _, graph, _, _, _ = cu_call(
  File "/home/jovyan/NeMo/nemo/core/utils/cuda_python_utils.py", line 101, in cu_call
    raise Exception(f"CUDA failure! {error}")
Exception: CUDA failure! 35

Dec 09 '25 03:12 utunga

Workarounds

Yes, downgrading NeMo to v2.5.3 fixes the problem - unfortunately since I want to work with the latest multitalker models which were added in v2.6.0 that is not an option for me.
Adding

asr_model.decoding.decoding.decoding_computer.disable_cuda_graphs()

right before transcribe also avoids the problem. Presumably at the expense of not doing this https://arxiv.org/abs/2406.06220 😢

Dec 09 '25 04:12 utunga

Workarounds

Yes, downgrading NeMo to v2.5.3 fixes the problem - unfortunately since I want to work with the latest multitalker models which were added in v2.6.0 that is not an option for me.

Adding
asr_model.decoding.decoding.decoding_computer.disable_cuda_graphs()
right before transcribe also avoids the problem. Presumably at the expense of not doing this https://arxiv.org/abs/2406.06220 😢

Thank you so much! Downgrading NeMo to v2.5.3 solved the issue for me.

Dec 10 '25 12:12 dorispei

@dorispei, @utunga Please check https://github.com/NVIDIA-NeMo/NeMo/pull/15173 That PR adds a fallback option to use native PyTorch CUDA graphs if full graph compilation failed. Should work a bit slower than default full CUDA graphs, but still preserve most of the speed (unlike the solution with disabling CUDA graphs).

Dec 10 '25 13:12 artbataev

thanks @artbataev that looks great. I see that it has been merged to main so i did a git pull (from this repo) and commented out my workaround

# comented out - so i guess cuda_graphs are enabled..?
# asr_model.decoding.decoding.decoding_computer.disable_cuda_graphs()

Looks like it still failed to compile though for this context (doing a finetune). Cf this comment


NeMo I 2025-12-13 02:59:52 asr_model:209] CUDA graphs disabled for EncDecMultiTalkerRNNTBPEModel::RNNTBPEDecoding::GreedyBatchedRNNTInfer
    
Epoch 0: |                                                                                                                                                                                                                                                     

| 1/? [00:04<00:00,  0.22it/s, v_num=fchv, train_step_timing in s=3.770[NeMo I 2025-12-13 03:01:44 asr_model:224] CUDA graphs enabled for EncDecMultiTalkerRNNTBPEModel::RNNTBPEDecoding::GreedyBatchedRNNTInfer                                                                                                                                                                         | 0/? [00:00<?, ?it/s]
                                                                                                                                                                                                                                                                                                                                      [NeMo W 2025-12-13 03:01:45 rnnt_label_looping:688] Full CUDA graph compilation failed: CUDA failure! 35. Falling back to native PyTorch CUDA graphs. Decoding will be slower.                                                                                                                                    | 0/? [00:00<?, ?it/s]
[NeMo I 2025-12-13 03:01:48 metric:549]

So it looks like it is training is going ahead - just slower.

Appreciate the work on this, thought I'd provide the feedback. Thanks!

Dec 12 '25 14:12 utunga