Discrepancy in custom transcribe pipeline vs. `model.transcribe()` for QuartzNet model

Open diarray-hub opened this issue 9 months ago • 10 comments

Describe the bug

I’m attempting to replicate the model.transcribe() method with a custom inference pipeline (for eventual mobile deployment in a Flutter app). However, I’m observing a large discrepancy between the outputs of my custom pipeline and the official model.transcribe() method on the exact same audio file. Specifically:

The transcription text differs significantly.
The Hypothesis.y_sequence shapes and values are drastically different. In the official transcribe pipeline, y_sequence is continuous of shape [T, D], while in my pipeline it ends up as discrete indices of shape [T].

I’ve tried calling the same model.decoding.ctc_decoder_predictions_tensor() function and ensuring all arguments match. I also verified that no additional data augmentation or special channel selection is being applied. Yet the results remain inconsistent.

We suspect there’s some hidden post-processing step (beyond decode_hypothesis()) or a difference in how transcribe() manages decoding configuration that we’re not replicating, but we can’t pinpoint where it’s happening.

Steps/Code to reproduce bug

Below is a minimal snippet of how I’m trying to replicate transcribe():

def load_audio(filepath: str, sample_rate=16000) -> Tuple[torch.Tensor, torch.Tensor]:
    audio_np, sr = librosa.load(filepath, sr=sample_rate)
    audio_tensor = torch.tensor(audio_np, dtype=torch.float32).unsqueeze(0)
    length_tensor = torch.tensor([audio_tensor.size(1)], dtype=torch.long)
    return audio_tensor, length_tensor

def transcribe_inference(model, filepath: str, return_hypotheses: bool = False):
    # 1) Load the audio
    audio_tensor, length_tensor = load_audio(filepath)

    # 2) Forward pass
    log_probs, encoded_len, predictions = model.forward(
        input_signal=audio_tensor, input_signal_length=length_tensor
    )

    # 3) Use the same decoding function as Nemo
    hypotheses, _ = model.decoding.ctc_decoder_predictions_tensor(
        decoder_outputs=log_probs,
        decoder_lengths=encoded_len,
        return_hypotheses=return_hypotheses,
    )
    return hypotheses, predictions

# Example usage:
transcriptions, predictions = transcribe_inference(model, "some_audio.wav", return_hypotheses=True)
print("Custom pipeline transcription:", transcriptions[0].text)

And here’s how I call the official method:

result = model.transcribe(["some_audio.wav"], return_hypotheses=True)
print("Official model.transcribe() result:", result[0].text)

Discrepancy:

The official model.transcribe() returns a Hypothesis object with y_sequence shaped [429, 46] (continuous), plus a very accurate text.
My pipeline yields y_sequence shaped [429] (discrete indices), and a less accurate text.

I tried enabling timestamps, verifying the change_decoding_strategy(), checking if dither or augmentation is disabled, etc. No luck so far. It seems the pipeline is missing an internal step that transcribe() does after forward() but before returning the final Hypothesis.

Expected behavior

I expect that by calling the same decoding function (model.decoding.ctc_decoder_predictions_tensor) on the same input, I would get identical or near‑identical text/hypotheses as model.transcribe(). Instead, I’m seeing major differences in both text and the shape/values of Hypothesis.y_sequence.

Environment overview

Environment location: Bare-metal
Method of NeMo install: pip install nemo_toolkit['asr'] (version 2.1.0)
PyTorch version: 2.5.1
Python version: 3.10
OS: Ubuntu 22.04
GPU model: None (using CPU)

Additional context

My ultimate goal is to deploy the fine‑tuned QuartzNet model on mobile (Flutter). TorchScript attempts failed with dithering errors, so I pivoted to ONNX. However, I do understand that we'll have like a tone of code to write in dart to replicate faithfully the model.transcribe method (preprocessing, post processing). But before doing that we wanted to test to replicate transcribe method in python at a level closer to nemo, to know exactly which pre/post steps to port into Dart.
The main confusion: Why does transcribe() produce a [T, D] y_sequence with continuous values, while my pipeline produces a [T] discrete sequence even though I call the same decoding function?
Possibly some hidden step in _transcribe_output_processing() or decode_hypothesis() is not triggered in my pipeline. But I’ve tried manually calling them without success.

Any guidance on which part of the pipeline is missing or how to replicate model.transcribe() exactly would be very helpful. If there is a simpler way to integrate NeMo in Flutter also, please let me know

Thank you!

Mar 27 '25 19:03 diarray-hub