NeMo icon indicating copy to clipboard operation
NeMo copied to clipboard

[NeMo 2.6] RNNT ASR inference fails on A100 (CUDA 12.8, PyTorch 2.9) with CUDA Graphs error CUDA failure! 35

Open dorispei opened this issue 1 month ago • 2 comments

Describe the bug

Running RNNT ASR inference with NeMo 2.6.0 on an NVIDIA A100 (CUDA 12.8, PyTorch 2.9.1+cu128) fails during decoding due to a CUDA Graphs initialization error:

Transcribing: 0it [00:00, ?it/s]
Traceback (most recent call last):
File "/scratch/amlt_code/empty_template.py", line 3, in
output = asr_model.transcribe(['2086-149220-0033.wav'])
File "/home/aiscuser/.local/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 120, in decorate_context
return func(*args, **kwargs)
File "/home/aiscuser/.local/lib/python3.10/site-packages/nemo/collections/asr/models/rnnt_models.py", line 306, in transcribe
return super().transcribe(
File "/home/aiscuser/.local/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 120, in decorate_context
return func(*args, **kwargs)
File "/home/aiscuser/.local/lib/python3.10/site-packages/nemo/collections/asr/parts/mixins/transcription.py", line 270, in transcribe
for processed_outputs in generator:
File "/home/aiscuser/.local/lib/python3.10/site-packages/nemo/collections/asr/parts/mixins/transcription.py", line 370, in transcribe_generator
processed_outputs = self._transcribe_output_processing(model_outputs, transcribe_cfg)
File "/home/aiscuser/.local/lib/python3.10/site-packages/nemo/collections/asr/models/rnnt_models.py", line 944, in _transcribe_output_processing
hyp = self.decoding.rnnt_decoder_predictions_tensor(
File "/home/aiscuser/.local/lib/python3.10/site-packages/nemo/collections/asr/parts/submodules/rnnt_decoding.py", line 717, in rnnt_decoder_predictions_tensor
hypotheses_list = self.decoding(
File "/home/aiscuser/.local/lib/python3.10/site-packages/nemo/collections/asr/parts/submodules/rnnt_greedy_decoding.py", line 201, in call
return self.forward(*args, **kwargs)
File "/home/aiscuser/.local/lib/python3.10/site-packages/nemo/core/classes/common.py", line 1204, in wrapped_call
outputs = wrapped(*args, **kwargs)
File "/home/aiscuser/.local/lib/python3.10/site-packages/nemo/collections/asr/parts/submodules/rnnt_greedy_decoding.py", line 760, in forward
hypotheses = self._greedy_decode(
File "/home/aiscuser/.local/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 120, in decorate_context
return func(*args, **kwargs)
File "/home/aiscuser/.local/lib/python3.10/site-packages/nemo/collections/asr/parts/submodules/rnnt_greedy_decoding.py", line 791, in _greedy_decode_blank_as_pad_loop_labels
batched_hyps, alignments, batched_state = self.decoding_computer(
File "/home/aiscuser/.local/lib/python3.10/site-packages/nemo/collections/asr/parts/submodules/transducer_decoding/label_looping_base.py", line 217, in call
return self.cuda_graphs_impl(
File "/home/aiscuser/.local/lib/python3.10/site-packages/nemo/collections/asr/parts/submodules/transducer_decoding/rnnt_label_looping.py", line 686, in cuda_graphs_impl
self._graph_reinitialize(encoder_output, encoder_output_length)
File "/home/aiscuser/.local/lib/python3.10/site-packages/nemo/collections/asr/parts/submodules/transducer_decoding/rnnt_label_looping.py", line 863, in _graph_reinitialize
self._full_graph_compile()
File "/home/aiscuser/.local/lib/python3.10/site-packages/nemo/collections/asr/parts/submodules/transducer_decoding/rnnt_label_looping.py", line 946, in _full_graph_compile
capture_status, _, graph, _, _, _ = cu_call(
File "/home/aiscuser/.local/lib/python3.10/site-packages/nemo/core/utils/cuda_python_utils.py", line 101, in cu_call
raise Exception(f"CUDA failure! {error}")
Exception: CUDA failure! 35

Steps/Code to reproduce bug

Reproduction Code

import nemo.collections.asr as nemo_asr

asr_model = nemo_asr.models.ASRModel.from_pretrained("nvidia/stt_en_conformer_transducer_large")
output = asr_model.transcribe(["2086-149220-0033.wav"])
print(output)

Environment overview

Component Version
GPU NVIDIA A100 80GB PCIe
Driver 570.133.20
CUDA Runtime 12.8
PyTorch 2.9.1+cu128
Torchaudio 2.3.1
NeMo 2.6.0
Python 3.10.14

dorispei avatar Dec 04 '25 10:12 dorispei

FYI also getting very similar error with a similar (but not quite identical) setup

Environment overview

Component Version
GPU NVIDIA A100-SXM4-80GB
Driver 570.148.08
CUDA Runtime 12.8
torch 2.8.0+cu128
torchaudio 2.8.0+cu128
NeMo v2.6.0 tag
Python 3.10.12

Reproducing code

import nemo.collections.asr as nemo_asr

asr_model = nemo_asr.models.ASRModel.from_pretrained("nvidia/multitalker-parakeet-streaming-0.6b-v1")
WAV_FILE = 'notebooks/test_data/test_file.wav'
asr_output = asr_model.transcribe([WAV_FILE])

Stack trace

Traceback (most recent call last):
  File "./gputest.py", line 6, in <module>
    asr_output = asr_model.transcribe([WAV_FILE])
  File "/opt/conda/envs/nemo/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 120, in decorate_context
    return func(*args, **kwargs)
  File "/home/jovyan/NeMo/nemo/collections/asr/models/rnnt_models.py", line 306, in transcribe
    return super().transcribe(
  File "/opt/conda/envs/nemo/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 120, in decorate_context
    return func(*args, **kwargs)
  File "/home/jovyan/NeMo/nemo/collections/asr/parts/mixins/transcription.py", line 270, in transcribe
    for processed_outputs in generator:
  File "/home/jovyan/NeMo/nemo/collections/asr/parts/mixins/transcription.py", line 370, in transcribe_generator
    processed_outputs = self._transcribe_output_processing(model_outputs, transcribe_cfg)
  File "/home/jovyan/NeMo/nemo/collections/asr/models/rnnt_models.py", line 944, in _transcribe_output_processing
    hyp = self.decoding.rnnt_decoder_predictions_tensor(
  File "/home/jovyan/NeMo/nemo/collections/asr/parts/submodules/rnnt_decoding.py", line 717, in rnnt_decoder_predictions_tensor
    hypotheses_list = self.decoding(
  File "/home/jovyan/NeMo/nemo/collections/asr/parts/submodules/rnnt_greedy_decoding.py", line 201, in __call__
    return self.forward(*args, **kwargs)
  File "/home/jovyan/NeMo/nemo/core/classes/common.py", line 1204, in wrapped_call
    outputs = wrapped(*args, **kwargs)
  File "/home/jovyan/NeMo/nemo/collections/asr/parts/submodules/rnnt_greedy_decoding.py", line 760, in forward
    hypotheses = self._greedy_decode(
  File "/opt/conda/envs/nemo/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 120, in decorate_context
    return func(*args, **kwargs)
  File "/home/jovyan/NeMo/nemo/collections/asr/parts/submodules/rnnt_greedy_decoding.py", line 791, in _greedy_decode_blank_as_pad_loop_labels
    batched_hyps, alignments, batched_state = self.decoding_computer(
  File "/home/jovyan/NeMo/nemo/collections/asr/parts/submodules/transducer_decoding/label_looping_base.py", line 217, in __call__
    return self.cuda_graphs_impl(
  File "/home/jovyan/NeMo/nemo/collections/asr/parts/submodules/transducer_decoding/rnnt_label_looping.py", line 686, in cuda_graphs_impl
    self._graph_reinitialize(encoder_output, encoder_output_length)
  File "/home/jovyan/NeMo/nemo/collections/asr/parts/submodules/transducer_decoding/rnnt_label_looping.py", line 863, in _graph_reinitialize
    self._full_graph_compile()
  File "/home/jovyan/NeMo/nemo/collections/asr/parts/submodules/transducer_decoding/rnnt_label_looping.py", line 946, in _full_graph_compile
    capture_status, _, graph, _, _, _ = cu_call(
  File "/home/jovyan/NeMo/nemo/core/utils/cuda_python_utils.py", line 101, in cu_call
    raise Exception(f"CUDA failure! {error}")
Exception: CUDA failure! 35

utunga avatar Dec 09 '25 03:12 utunga

Workarounds

  1. Yes, downgrading NeMo to v2.5.3 fixes the problem - unfortunately since I want to work with the latest multitalker models which were added in v2.6.0 that is not an option for me.

  2. Adding

asr_model.decoding.decoding.decoding_computer.disable_cuda_graphs()

right before transcribe also avoids the problem. Presumably at the expense of not doing this https://arxiv.org/abs/2406.06220 😢

utunga avatar Dec 09 '25 04:12 utunga

Workarounds

  1. Yes, downgrading NeMo to v2.5.3 fixes the problem - unfortunately since I want to work with the latest multitalker models which were added in v2.6.0 that is not an option for me.
  2. Adding
asr_model.decoding.decoding.decoding_computer.disable_cuda_graphs()

right before transcribe also avoids the problem. Presumably at the expense of not doing this https://arxiv.org/abs/2406.06220 😢

Thank you so much! Downgrading NeMo to v2.5.3 solved the issue for me.

dorispei avatar Dec 10 '25 12:12 dorispei

@dorispei, @utunga Please check https://github.com/NVIDIA-NeMo/NeMo/pull/15173 That PR adds a fallback option to use native PyTorch CUDA graphs if full graph compilation failed. Should work a bit slower than default full CUDA graphs, but still preserve most of the speed (unlike the solution with disabling CUDA graphs).

artbataev avatar Dec 10 '25 13:12 artbataev

thanks @artbataev that looks great. I see that it has been merged to main so i did a git pull (from this repo) and commented out my workaround

# comented out - so i guess cuda_graphs are enabled..?
# asr_model.decoding.decoding.decoding_computer.disable_cuda_graphs()

Looks like it still failed to compile though for this context (doing a finetune). Cf this comment


NeMo I 2025-12-13 02:59:52 asr_model:209] CUDA graphs disabled for EncDecMultiTalkerRNNTBPEModel::RNNTBPEDecoding::GreedyBatchedRNNTInfer
    
Epoch 0: |                                                                                                                                                                                                                                                     

| 1/? [00:04<00:00,  0.22it/s, v_num=fchv, train_step_timing in s=3.770[NeMo I 2025-12-13 03:01:44 asr_model:224] CUDA graphs enabled for EncDecMultiTalkerRNNTBPEModel::RNNTBPEDecoding::GreedyBatchedRNNTInfer                                                                                                                                                                         | 0/? [00:00<?, ?it/s]
                                                                                                                                                                                                                                                                                                                                      [NeMo W 2025-12-13 03:01:45 rnnt_label_looping:688] Full CUDA graph compilation failed: CUDA failure! 35. Falling back to native PyTorch CUDA graphs. Decoding will be slower.                                                                                                                                    | 0/? [00:00<?, ?it/s]
[NeMo I 2025-12-13 03:01:48 metric:549] 

So it looks like it is training is going ahead - just slower.

Appreciate the work on this, thought I'd provide the feedback. Thanks!

utunga avatar Dec 12 '25 14:12 utunga