compute_rnnt_timestamps fails with empty char offsets
Hi,
I am trying to use parakeet-tdt-0.6b-v2 for transcribing and obtaining word timestamps with confidences.
Following the colab here https://github.com/NVIDIA/NeMo/blob/main/tutorials/asr/ASR_Confidence_Estimation.ipynb, I have the following config:
Changed decoding strategy to
model_type: rnnt
strategy: greedy_batch
compute_hypothesis_token_set: false
preserve_alignments: true
tdt_include_token_duration: null
confidence_cfg:
preserve_frame_confidence: true
preserve_token_confidence: true
preserve_word_confidence: true
exclude_blank: false
aggregation: prod
tdt_include_duration: false
method_cfg:
name: max_prob
entropy_type: gibbs
alpha: 0.5
entropy_norm: lin
temperature: DEPRECATED
fused_batch_size: -1
compute_timestamps: null
compute_langs: false
word_seperator: ' '
segment_seperators:
- .
- '!'
- '?'
segment_gap_threshold: null
rnnt_timestamp_type: all
greedy:
max_symbols_per_step: 10
preserve_alignments: false
preserve_frame_confidence: false
tdt_include_token_duration: false
tdt_include_duration_confidence: false
confidence_method_cfg:
name: entropy
entropy_type: tsallis
alpha: 0.33
entropy_norm: exp
temperature: DEPRECATED
loop_labels: true
use_cuda_graph_decoder: true
ngram_lm_model: null
ngram_lm_alpha: 0.0
beam:
beam_size: 4
search_type: default
score_norm: true
return_best_hypothesis: true
tsd_max_sym_exp_per_step: 50
alsd_max_target_len: 1.0
nsc_max_timesteps_expansion: 1
nsc_prefix_alpha: 1
maes_num_steps: 2
maes_prefix_alpha: 1
maes_expansion_gamma: 2.3
maes_expansion_beta: 2
language_model: null
softmax_temperature: 1.0
preserve_alignments: false
ngram_lm_model: null
ngram_lm_alpha: 0.0
hat_subtract_ilm: false
hat_ilm_weight: 0.0
temperature: 1.0
durations:
- 0
- 1
- 2
- 3
- 4
big_blank_durations: []
For some audios however, I get errors like this:
File "/home/silvianacimpian/rnd-transcription/.venv/lib/python3.12/site-packages/nemo/collections/asr/models/rnnt_models.py", line 307, in transcribe
return super().transcribe(
^^^^^^^^^^^^^^^^^^^
File "/home/silvianacimpian/rnd-transcription/.venv/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/home/silvianacimpian/rnd-transcription/.venv/lib/python3.12/site-packages/nemo/collections/asr/parts/mixins/transcription.py", line 269, in transcribe
for processed_outputs in generator:
^^^^^^^^^
File "/home/silvianacimpian/rnd-transcription/.venv/lib/python3.12/site-packages/nemo/collections/asr/parts/mixins/transcription.py", line 380, in transcribe_generator
processed_outputs = self._transcribe_output_processing(model_outputs, transcribe_cfg)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/silvianacimpian/rnd-transcription/.venv/lib/python3.12/site-packages/nemo/collections/asr/models/rnnt_models.py", line 945, in _transcribe_output_processing
hyp = self.decoding.rnnt_decoder_predictions_tensor(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/silvianacimpian/rnd-transcription/.venv/lib/python3.12/site-packages/nemo/collections/asr/parts/submodules/rnnt_decoding.py", line 561, in rnnt_decoder_predictions_tensor
hypotheses[hyp_idx] = self.compute_rnnt_timestamps(hypotheses[hyp_idx], timestamp_type)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/silvianacimpian/rnd-transcription/.venv/lib/python3.12/site-packages/nemo/collections/asr/parts/submodules/rnnt_decoding.py", line 836, in compute_rnnt_timestamps
raise ValueError(
ValueError: `char_offsets`: [] and `processed_tokens`: [490, 841] have to be of the same length, but are: `len(offsets)`: 0 and `len(processed_tokens)`: 2
The example I have is of audio shorter than 1s (0.63s, waveform shape (10177,)), and when transcribing it without timestamps I get Yeah. as the output.
Is there anything missing in my set-up to get this working properly or is this a bug related to <1 s audios?
Looking more in depth, I got to the point where results are added to batchedHyp:
Adding results masked no checks
Active mask: tensor([True], device='cuda:0')
Labels: tensor([490], device='cuda:0')
Time indices: tensor([0], device='cuda:0')
Scores: tensor([35.5236], device='cuda:0')
Token durations: tensor([0], device='cuda:0')
Adding results masked no checks
Active mask: tensor([True], device='cuda:0')
Labels: tensor([841], device='cuda:0')
Time indices: tensor([0], device='cuda:0')
Scores: tensor([42.1788], device='cuda:0')
Token durations: tensor([0], device='cuda:0')
Adding results masked no checks
Active mask: tensor([False], device='cuda:0')
Labels: tensor([1024], device='cuda:0')
Time indices: tensor([7], device='cuda:0')
Scores: tensor([82.3289], device='cuda:0')
Token durations: tensor([1], device='cuda:0')
And here, since all the durations are 0, an empty torch array is returned.
could it be that the tokens are so short that duration ends up being 0?
@nithinraok could you help take a look?
@nithinraok Any updates on this? I'm facing a similar error.
Thanks for reporting this, is it possible to share the audio sample and steps to reproduce the bug
AI-generated solution, please verify
The error you're seeing occurs because the parakeet-tdt-0.6b-v2 model has difficulty generating character offsets for very short audio files (less than 1 second).
When processing your 0.63s audio file, the model successfully recognizes the speech content ("Yeah.") but fails to generate the timestamp information. The compute_rnnt_timestamps function expects each token to have corresponding character offsets, but for very short audio files, these offsets aren't being generated properly.
This is a known limitation with TDT (Token Duration and Timestamp) models when processing extremely short audio segments. The model needs sufficient audio context to reliably generate timestamp information.
To work around this issue, you can either:
- Process the audio without requesting timestamps (disable compute_timestamps in your config)
- Pad your short audio files to be at least 1-2 seconds long
- Use a different model that doesn't rely on the TDT architecture for very short audio clips
@nithinraok
Traceback (most recent call last):
File "/mnt/nvme0/aniket.tiwari/parakeet_ft/speech_to_text_hybrid_rnnt_ctc_bpe.py", line 84, in main
trainer.fit(asr_model)
File "/mnt/nvme0/aniket.tiwari/miniconda3/envs/nemo_new/lib/python3.10/site-packages/lightning/pytorch/trainer/trainer.py", line 538, in fit
call._call_and_handle_interrupt(
File "/mnt/nvme0/aniket.tiwari/miniconda3/envs/nemo_new/lib/python3.10/site-packages/lightning/pytorch/trainer/call.py", line 46, in _call_and_handle_interrupt
return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs)
File "/mnt/nvme0/aniket.tiwari/miniconda3/envs/nemo_new/lib/python3.10/site-packages/lightning/pytorch/strategies/launchers/subprocess_script.py", line 105, in launch
return function(*args, **kwargs)
File "/mnt/nvme0/aniket.tiwari/miniconda3/envs/nemo_new/lib/python3.10/site-packages/lightning/pytorch/trainer/trainer.py", line 574, in _fit_impl
self._run(model, ckpt_path=ckpt_path)
File "/mnt/nvme0/aniket.tiwari/miniconda3/envs/nemo_new/lib/python3.10/site-packages/lightning/pytorch/trainer/trainer.py", line 981, in _run
results = self._run_stage()
File "/mnt/nvme0/aniket.tiwari/miniconda3/envs/nemo_new/lib/python3.10/site-packages/lightning/pytorch/trainer/trainer.py", line 1025, in _run_stage
self.fit_loop.run()
File "/mnt/nvme0/aniket.tiwari/miniconda3/envs/nemo_new/lib/python3.10/site-packages/lightning/pytorch/loops/fit_loop.py", line 205, in run
self.advance()
File "/mnt/nvme0/aniket.tiwari/miniconda3/envs/nemo_new/lib/python3.10/site-packages/lightning/pytorch/loops/fit_loop.py", line 363, in advance
self.epoch_loop.run(self._data_fetcher)
File "/mnt/nvme0/aniket.tiwari/miniconda3/envs/nemo_new/lib/python3.10/site-packages/lightning/pytorch/loops/training_epoch_loop.py", line 140, in run
self.advance(data_fetcher)
File "/mnt/nvme0/aniket.tiwari/miniconda3/envs/nemo_new/lib/python3.10/site-packages/lightning/pytorch/loops/training_epoch_loop.py", line 250, in advance
batch_output = self.automatic_optimization.run(trainer.optimizers[0], batch_idx, kwargs)
File "/mnt/nvme0/aniket.tiwari/miniconda3/envs/nemo_new/lib/python3.10/site-packages/lightning/pytorch/loops/optimization/automatic.py", line 183, in run
closure()
File "/mnt/nvme0/aniket.tiwari/miniconda3/envs/nemo_new/lib/python3.10/site-packages/lightning/pytorch/loops/optimization/automatic.py", line 144, in __call__
self._result = self.closure(*args, **kwargs)
File "/mnt/nvme0/aniket.tiwari/miniconda3/envs/nemo_new/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
return func(*args, **kwargs)
File "/mnt/nvme0/aniket.tiwari/miniconda3/envs/nemo_new/lib/python3.10/site-packages/lightning/pytorch/loops/optimization/automatic.py", line 129, in closure
step_output = self._step_fn()
File "/mnt/nvme0/aniket.tiwari/miniconda3/envs/nemo_new/lib/python3.10/site-packages/lightning/pytorch/loops/optimization/automatic.py", line 317, in _training_step
training_step_output = call._call_strategy_hook(trainer, "training_step", *kwargs.values())
File "/mnt/nvme0/aniket.tiwari/miniconda3/envs/nemo_new/lib/python3.10/site-packages/lightning/pytorch/trainer/call.py", line 319, in _call_strategy_hook
output = fn(*args, **kwargs)
File "/mnt/nvme0/aniket.tiwari/miniconda3/envs/nemo_new/lib/python3.10/site-packages/lightning/pytorch/strategies/strategy.py", line 389, in training_step
return self._forward_redirection(self.model, self.lightning_module, "training_step", *args, **kwargs)
File "/mnt/nvme0/aniket.tiwari/miniconda3/envs/nemo_new/lib/python3.10/site-packages/lightning/pytorch/strategies/strategy.py", line 640, in __call__
wrapper_output = wrapper_module(*args, **kwargs)
File "/mnt/nvme0/aniket.tiwari/miniconda3/envs/nemo_new/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/mnt/nvme0/aniket.tiwari/miniconda3/envs/nemo_new/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
return forward_call(*args, **kwargs)
File "/mnt/nvme0/aniket.tiwari/miniconda3/envs/nemo_new/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 1643, in forward
else self._run_ddp_forward(*inputs, **kwargs)
File "/mnt/nvme0/aniket.tiwari/miniconda3/envs/nemo_new/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 1459, in _run_ddp_forward
return self.module(*inputs, **kwargs) # type: ignore[index]
File "/mnt/nvme0/aniket.tiwari/miniconda3/envs/nemo_new/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/mnt/nvme0/aniket.tiwari/miniconda3/envs/nemo_new/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
return forward_call(*args, **kwargs)
File "/mnt/nvme0/aniket.tiwari/miniconda3/envs/nemo_new/lib/python3.10/site-packages/lightning/pytorch/strategies/strategy.py", line 633, in wrapped_forward
out = method(*_args, **_kwargs)
File "/mnt/nvme0/aniket.tiwari/parakeet_ft/NeMo/nemo/utils/model_utils.py", line 477, in wrap_training_step
output_dict = wrapped(*args, **kwargs)
File "/mnt/nvme0/aniket.tiwari/parakeet_ft/NeMo/nemo/collections/asr/models/hybrid_rnnt_ctc_models.py", line 434, in training_step
self.wer.update(
File "/mnt/nvme0/aniket.tiwari/miniconda3/envs/nemo_new/lib/python3.10/site-packages/torchmetrics/metric.py", line 549, in wrapped_func
update(*args, **kwargs)
File "/mnt/nvme0/aniket.tiwari/parakeet_ft/NeMo/nemo/collections/asr/metrics/wer.py", line 330, in update
self.decode(predictions, predictions_lengths, predictions_mask, input_ids)
File "/mnt/nvme0/aniket.tiwari/parakeet_ft/NeMo/nemo/collections/asr/metrics/wer.py", line 270, in <lambda>
self.decode = lambda predictions, predictions_lengths, predictions_mask, input_ids: self.decoding.rnnt_decoder_predictions_tensor(
File "/mnt/nvme0/aniket.tiwari/parakeet_ft/NeMo/nemo/collections/asr/parts/submodules/rnnt_decoding.py", line 639, in rnnt_decoder_predictions_tensor
hypotheses[hyp_idx] = self.compute_rnnt_timestamps(hypotheses[hyp_idx], timestamp_type)
File "/mnt/nvme0/aniket.tiwari/parakeet_ft/NeMo/nemo/collections/asr/parts/submodules/rnnt_decoding.py", line 927, in compute_rnnt_timestamps
raise ValueError(
ValueError: `char_offsets`: [] and `processed_tokens`: [969] have to be of the same length, but are: `len(offsets)`: 0 and `len(processed_tokens)`: 1
I'm having trouble identifying the audio samples while training the model (5.7Mn files).
Any updates on that?
From my observations so far, it looks like changing the tokenizer at each run fixes it. I have encountered this error while trying to finetune parakeet-tdt-ctc-v2 and I'm only running into this error when I do not change the vocabulary even if I'm changing it for the exact same tokenizer, it works.
Changing Vocabulary prior to each run somehow avoids the error but that also means reinitializing the decoder at every run, making continued training based on checkpoints less effective.
The above result shows that the problem isn't with some characters that the tokenizer might not support, I also tried normalizing my labels and the error occurs on example with arbitrary length of sequence (thus it shouldn't be related to the length of the transcription).
If anyone has a better fix than changing the tokenizer all together, it may help somebody.
In my case, setting preserve_alignments and compute_timestamps to False solved the error for me.