NeMo icon indicating copy to clipboard operation
NeMo copied to clipboard

compute_rnnt_timestamps fails with empty char offsets

Open silvianacmp opened this issue 4 months ago • 9 comments

Hi, I am trying to use parakeet-tdt-0.6b-v2 for transcribing and obtaining word timestamps with confidences. Following the colab here https://github.com/NVIDIA/NeMo/blob/main/tutorials/asr/ASR_Confidence_Estimation.ipynb, I have the following config:

Changed decoding strategy to 
    model_type: rnnt
    strategy: greedy_batch
    compute_hypothesis_token_set: false
    preserve_alignments: true
    tdt_include_token_duration: null
    confidence_cfg:
      preserve_frame_confidence: true
      preserve_token_confidence: true
      preserve_word_confidence: true
      exclude_blank: false
      aggregation: prod
      tdt_include_duration: false
      method_cfg:
        name: max_prob
        entropy_type: gibbs
        alpha: 0.5
        entropy_norm: lin
        temperature: DEPRECATED
    fused_batch_size: -1
    compute_timestamps: null
    compute_langs: false
    word_seperator: ' '
    segment_seperators:
    - .
    - '!'
    - '?'
    segment_gap_threshold: null
    rnnt_timestamp_type: all
    greedy:
      max_symbols_per_step: 10
      preserve_alignments: false
      preserve_frame_confidence: false
      tdt_include_token_duration: false
      tdt_include_duration_confidence: false
      confidence_method_cfg:
        name: entropy
        entropy_type: tsallis
        alpha: 0.33
        entropy_norm: exp
        temperature: DEPRECATED
      loop_labels: true
      use_cuda_graph_decoder: true
      ngram_lm_model: null
      ngram_lm_alpha: 0.0
    beam:
      beam_size: 4
      search_type: default
      score_norm: true
      return_best_hypothesis: true
      tsd_max_sym_exp_per_step: 50
      alsd_max_target_len: 1.0
      nsc_max_timesteps_expansion: 1
      nsc_prefix_alpha: 1
      maes_num_steps: 2
      maes_prefix_alpha: 1
      maes_expansion_gamma: 2.3
      maes_expansion_beta: 2
      language_model: null
      softmax_temperature: 1.0
      preserve_alignments: false
      ngram_lm_model: null
      ngram_lm_alpha: 0.0
      hat_subtract_ilm: false
      hat_ilm_weight: 0.0
    temperature: 1.0
    durations:
    - 0
    - 1
    - 2
    - 3
    - 4
    big_blank_durations: []

For some audios however, I get errors like this:

  File "/home/silvianacimpian/rnd-transcription/.venv/lib/python3.12/site-packages/nemo/collections/asr/models/rnnt_models.py", line 307, in transcribe
    return super().transcribe(
           ^^^^^^^^^^^^^^^^^^^
  File "/home/silvianacimpian/rnd-transcription/.venv/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/home/silvianacimpian/rnd-transcription/.venv/lib/python3.12/site-packages/nemo/collections/asr/parts/mixins/transcription.py", line 269, in transcribe
    for processed_outputs in generator:
                             ^^^^^^^^^
  File "/home/silvianacimpian/rnd-transcription/.venv/lib/python3.12/site-packages/nemo/collections/asr/parts/mixins/transcription.py", line 380, in transcribe_generator
    processed_outputs = self._transcribe_output_processing(model_outputs, transcribe_cfg)
                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/silvianacimpian/rnd-transcription/.venv/lib/python3.12/site-packages/nemo/collections/asr/models/rnnt_models.py", line 945, in _transcribe_output_processing
    hyp = self.decoding.rnnt_decoder_predictions_tensor(
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/silvianacimpian/rnd-transcription/.venv/lib/python3.12/site-packages/nemo/collections/asr/parts/submodules/rnnt_decoding.py", line 561, in rnnt_decoder_predictions_tensor
    hypotheses[hyp_idx] = self.compute_rnnt_timestamps(hypotheses[hyp_idx], timestamp_type)
                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/silvianacimpian/rnd-transcription/.venv/lib/python3.12/site-packages/nemo/collections/asr/parts/submodules/rnnt_decoding.py", line 836, in compute_rnnt_timestamps
    raise ValueError(
ValueError: `char_offsets`: [] and `processed_tokens`: [490, 841] have to be of the same length, but are: `len(offsets)`: 0 and `len(processed_tokens)`: 2

The example I have is of audio shorter than 1s (0.63s, waveform shape (10177,)), and when transcribing it without timestamps I get Yeah. as the output.

Is there anything missing in my set-up to get this working properly or is this a bug related to <1 s audios?

silvianacmp avatar Aug 07 '25 14:08 silvianacmp

Looking more in depth, I got to the point where results are added to batchedHyp:

Adding results masked no checks
Active mask: tensor([True], device='cuda:0')
Labels: tensor([490], device='cuda:0')
Time indices: tensor([0], device='cuda:0')
Scores: tensor([35.5236], device='cuda:0')
Token durations: tensor([0], device='cuda:0')
Adding results masked no checks
Active mask: tensor([True], device='cuda:0')
Labels: tensor([841], device='cuda:0')
Time indices: tensor([0], device='cuda:0')
Scores: tensor([42.1788], device='cuda:0')
Token durations: tensor([0], device='cuda:0')
Adding results masked no checks
Active mask: tensor([False], device='cuda:0')
Labels: tensor([1024], device='cuda:0')
Time indices: tensor([7], device='cuda:0')
Scores: tensor([82.3289], device='cuda:0')
Token durations: tensor([1], device='cuda:0')

And here, since all the durations are 0, an empty torch array is returned.

could it be that the tokens are so short that duration ends up being 0?

silvianacmp avatar Aug 07 '25 16:08 silvianacmp

@nithinraok could you help take a look?

ZhiyuLi-Nvidia avatar Aug 12 '25 17:08 ZhiyuLi-Nvidia

@nithinraok Any updates on this? I'm facing a similar error.

aniket-convinai avatar Sep 01 '25 14:09 aniket-convinai

Thanks for reporting this, is it possible to share the audio sample and steps to reproduce the bug

nithinraok avatar Sep 02 '25 13:09 nithinraok

AI-generated solution, please verify

The error you're seeing occurs because the parakeet-tdt-0.6b-v2 model has difficulty generating character offsets for very short audio files (less than 1 second).

When processing your 0.63s audio file, the model successfully recognizes the speech content ("Yeah.") but fails to generate the timestamp information. The compute_rnnt_timestamps function expects each token to have corresponding character offsets, but for very short audio files, these offsets aren't being generated properly.

This is a known limitation with TDT (Token Duration and Timestamp) models when processing extremely short audio segments. The model needs sufficient audio context to reliably generate timestamp information.

To work around this issue, you can either:

  1. Process the audio without requesting timestamps (disable compute_timestamps in your config)
  2. Pad your short audio files to be at least 1-2 seconds long
  3. Use a different model that doesn't rely on the TDT architecture for very short audio clips

zhenyih avatar Sep 03 '25 23:09 zhenyih

@nithinraok

Traceback (most recent call last):
  File "/mnt/nvme0/aniket.tiwari/parakeet_ft/speech_to_text_hybrid_rnnt_ctc_bpe.py", line 84, in main
    trainer.fit(asr_model)
  File "/mnt/nvme0/aniket.tiwari/miniconda3/envs/nemo_new/lib/python3.10/site-packages/lightning/pytorch/trainer/trainer.py", line 538, in fit
    call._call_and_handle_interrupt(
  File "/mnt/nvme0/aniket.tiwari/miniconda3/envs/nemo_new/lib/python3.10/site-packages/lightning/pytorch/trainer/call.py", line 46, in _call_and_handle_interrupt
    return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs)
  File "/mnt/nvme0/aniket.tiwari/miniconda3/envs/nemo_new/lib/python3.10/site-packages/lightning/pytorch/strategies/launchers/subprocess_script.py", line 105, in launch
    return function(*args, **kwargs)
  File "/mnt/nvme0/aniket.tiwari/miniconda3/envs/nemo_new/lib/python3.10/site-packages/lightning/pytorch/trainer/trainer.py", line 574, in _fit_impl
    self._run(model, ckpt_path=ckpt_path)
  File "/mnt/nvme0/aniket.tiwari/miniconda3/envs/nemo_new/lib/python3.10/site-packages/lightning/pytorch/trainer/trainer.py", line 981, in _run
    results = self._run_stage()
  File "/mnt/nvme0/aniket.tiwari/miniconda3/envs/nemo_new/lib/python3.10/site-packages/lightning/pytorch/trainer/trainer.py", line 1025, in _run_stage
    self.fit_loop.run()
  File "/mnt/nvme0/aniket.tiwari/miniconda3/envs/nemo_new/lib/python3.10/site-packages/lightning/pytorch/loops/fit_loop.py", line 205, in run
    self.advance()
  File "/mnt/nvme0/aniket.tiwari/miniconda3/envs/nemo_new/lib/python3.10/site-packages/lightning/pytorch/loops/fit_loop.py", line 363, in advance
    self.epoch_loop.run(self._data_fetcher)
  File "/mnt/nvme0/aniket.tiwari/miniconda3/envs/nemo_new/lib/python3.10/site-packages/lightning/pytorch/loops/training_epoch_loop.py", line 140, in run
    self.advance(data_fetcher)
  File "/mnt/nvme0/aniket.tiwari/miniconda3/envs/nemo_new/lib/python3.10/site-packages/lightning/pytorch/loops/training_epoch_loop.py", line 250, in advance
    batch_output = self.automatic_optimization.run(trainer.optimizers[0], batch_idx, kwargs)
  File "/mnt/nvme0/aniket.tiwari/miniconda3/envs/nemo_new/lib/python3.10/site-packages/lightning/pytorch/loops/optimization/automatic.py", line 183, in run
    closure()
  File "/mnt/nvme0/aniket.tiwari/miniconda3/envs/nemo_new/lib/python3.10/site-packages/lightning/pytorch/loops/optimization/automatic.py", line 144, in __call__
    self._result = self.closure(*args, **kwargs)
  File "/mnt/nvme0/aniket.tiwari/miniconda3/envs/nemo_new/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
  File "/mnt/nvme0/aniket.tiwari/miniconda3/envs/nemo_new/lib/python3.10/site-packages/lightning/pytorch/loops/optimization/automatic.py", line 129, in closure
    step_output = self._step_fn()
  File "/mnt/nvme0/aniket.tiwari/miniconda3/envs/nemo_new/lib/python3.10/site-packages/lightning/pytorch/loops/optimization/automatic.py", line 317, in _training_step
    training_step_output = call._call_strategy_hook(trainer, "training_step", *kwargs.values())
  File "/mnt/nvme0/aniket.tiwari/miniconda3/envs/nemo_new/lib/python3.10/site-packages/lightning/pytorch/trainer/call.py", line 319, in _call_strategy_hook
    output = fn(*args, **kwargs)
  File "/mnt/nvme0/aniket.tiwari/miniconda3/envs/nemo_new/lib/python3.10/site-packages/lightning/pytorch/strategies/strategy.py", line 389, in training_step
    return self._forward_redirection(self.model, self.lightning_module, "training_step", *args, **kwargs)
  File "/mnt/nvme0/aniket.tiwari/miniconda3/envs/nemo_new/lib/python3.10/site-packages/lightning/pytorch/strategies/strategy.py", line 640, in __call__
    wrapper_output = wrapper_module(*args, **kwargs)
  File "/mnt/nvme0/aniket.tiwari/miniconda3/envs/nemo_new/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/mnt/nvme0/aniket.tiwari/miniconda3/envs/nemo_new/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
    return forward_call(*args, **kwargs)
  File "/mnt/nvme0/aniket.tiwari/miniconda3/envs/nemo_new/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 1643, in forward
    else self._run_ddp_forward(*inputs, **kwargs)
  File "/mnt/nvme0/aniket.tiwari/miniconda3/envs/nemo_new/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 1459, in _run_ddp_forward
    return self.module(*inputs, **kwargs)  # type: ignore[index]
  File "/mnt/nvme0/aniket.tiwari/miniconda3/envs/nemo_new/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/mnt/nvme0/aniket.tiwari/miniconda3/envs/nemo_new/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
    return forward_call(*args, **kwargs)
  File "/mnt/nvme0/aniket.tiwari/miniconda3/envs/nemo_new/lib/python3.10/site-packages/lightning/pytorch/strategies/strategy.py", line 633, in wrapped_forward
    out = method(*_args, **_kwargs)
  File "/mnt/nvme0/aniket.tiwari/parakeet_ft/NeMo/nemo/utils/model_utils.py", line 477, in wrap_training_step
    output_dict = wrapped(*args, **kwargs)
  File "/mnt/nvme0/aniket.tiwari/parakeet_ft/NeMo/nemo/collections/asr/models/hybrid_rnnt_ctc_models.py", line 434, in training_step
    self.wer.update(
  File "/mnt/nvme0/aniket.tiwari/miniconda3/envs/nemo_new/lib/python3.10/site-packages/torchmetrics/metric.py", line 549, in wrapped_func
    update(*args, **kwargs)
  File "/mnt/nvme0/aniket.tiwari/parakeet_ft/NeMo/nemo/collections/asr/metrics/wer.py", line 330, in update
    self.decode(predictions, predictions_lengths, predictions_mask, input_ids)
  File "/mnt/nvme0/aniket.tiwari/parakeet_ft/NeMo/nemo/collections/asr/metrics/wer.py", line 270, in <lambda>
    self.decode = lambda predictions, predictions_lengths, predictions_mask, input_ids: self.decoding.rnnt_decoder_predictions_tensor(
  File "/mnt/nvme0/aniket.tiwari/parakeet_ft/NeMo/nemo/collections/asr/parts/submodules/rnnt_decoding.py", line 639, in rnnt_decoder_predictions_tensor
    hypotheses[hyp_idx] = self.compute_rnnt_timestamps(hypotheses[hyp_idx], timestamp_type)
  File "/mnt/nvme0/aniket.tiwari/parakeet_ft/NeMo/nemo/collections/asr/parts/submodules/rnnt_decoding.py", line 927, in compute_rnnt_timestamps
    raise ValueError(
ValueError: `char_offsets`: [] and `processed_tokens`: [969] have to be of the same length, but are: `len(offsets)`: 0 and `len(processed_tokens)`: 1

I'm having trouble identifying the audio samples while training the model (5.7Mn files).

aniket-convinai avatar Sep 12 '25 04:09 aniket-convinai

Any updates on that?

diarray-hub avatar Oct 29 '25 03:10 diarray-hub

From my observations so far, it looks like changing the tokenizer at each run fixes it. I have encountered this error while trying to finetune parakeet-tdt-ctc-v2 and I'm only running into this error when I do not change the vocabulary even if I'm changing it for the exact same tokenizer, it works.

Changing Vocabulary prior to each run somehow avoids the error but that also means reinitializing the decoder at every run, making continued training based on checkpoints less effective.

The above result shows that the problem isn't with some characters that the tokenizer might not support, I also tried normalizing my labels and the error occurs on example with arbitrary length of sequence (thus it shouldn't be related to the length of the transcription).

If anyone has a better fix than changing the tokenizer all together, it may help somebody.

diarray-hub avatar Oct 30 '25 11:10 diarray-hub

In my case, setting preserve_alignments and compute_timestamps to False solved the error for me.

aniket-convinai avatar Dec 10 '25 04:12 aniket-convinai