WhisperS2T Added medium and medium.en models for TensorRT-LLM backend

Seems to work for "medium" and "medium.en" models now, for tensorrt-llm backend. Fixes #30

Feb 21 '24 21:02 colinator

Hi @colinator can you run some WER checks on medium and medium.en models for TensorRT-LLM backend? According to TensorRT-LLM repo, they only support large model.

You can use these to run the tests: Prepare the env using this script: https://github.com/shashikg/WhisperS2T/blob/main/prepare_benchmark_env.sh Then use this: https://github.com/shashikg/WhisperS2T/blob/main/scripts/benchmark_whisper_s2t.py

Feb 21 '24 21:02 shashikg

Ahoy there. I couldn't find where in tensorrt-llm it only was compatible with large-v2. Maybe in the 'builder'? But that's just the builder. Your code seemed to work. The benchmark doesn't calculate WER right? The transcriptions seem plausible though:

data/KINCAID46/audio/9.mp3 For over a decade, on streets, and in bars, and in living rooms, from Rio to Reykjavik, and everywhere in between, the debate has raged. Lionel Messi, or Christiano Ronaldo. Who is the greatest player in the world? The truth is, we've never seen a rivalry like this. Not in football, not in any of our major sports. But that debate might soon be over, and it might end without Messi or Ronaldo getting the trophy they so glaringly lack, the World Cup. To understand why it matters that two individuals haven't won a team trophy that's only up for grabs every 4 years, you have to know just how dominate Messi and Ronaldo have been. The winner is, Christiano Ronaldo. From 2008 through 2017, the Ballon d'Or, basically the sports most valuable player award, has gone to either Ronaldo or Messi. Five for Ronaldo, five for Messi, none for anyone else. It's not just Ballon d'Or's for Messi and Ronaldo. Messi is the all time leading scorer in La Liga, in the Spanish super ...

You want me to attach the transcription csv files?

Feb 22 '24 00:02 colinator

Do have a script that performs WER calculation from the csv outputs? I see your WER function, but am not totally clear on any pre-processing (lowercasing, etc) you do when you calculate it...

Feb 22 '24 01:02 colinator

Well, here are the outputs. I'm on an RTX 3080, so slower than your results, and batch size is 16, because mem$.

results.zip

Feb 22 '24 01:02 colinator

Do have a script that performs WER calculation from the csv outputs? I see your WER function, but am not totally clear on any pre-processing (lowercasing, etc) you do when you calculate it...

Hey yes I normalize the text and then performs lowercasing as well. Here: https://github.com/shashikg/WhisperS2T/blob/main/tools/text_normalizer.py#L75

Then run this evaluate function on normalized texts: https://github.com/shashikg/WhisperS2T/blob/main/tools/metrics.py#L68

BTW, I quickly checked the outputs txt files, output looks good to me.

Feb 22 '24 19:02 shashikg

Hi @colinator any update?

Mar 01 '24 04:03 shashikg

I got this, for medium and medium.en. Card is rtx3080, if that matters...

Why is medium.en so much worse?

results/WhisperS2T-TensorRT-LLM-bs_16_medium
                Dataset        Time
0         KINCAID46 WAV   66.376619
1         KINCAID46 MP3   66.754254
2  MultiLingualLongform  158.262608
KINCAID46_WAV.tsv {'WER': 9.11, 'IER': 1.56, 'DER': 4.49, 'SER': 3.05, '5-GramInsertions': 35}
KINCAID46_MP3.tsv {'WER': 9.52, 'IER': 1.58, 'DER': 4.86, 'SER': 3.08, '5-GramInsertions': 31}
MultiLingualLongform.tsv {'WER': 9.4, 'IER': 3.03, 'DER': 3.19, 'SER': 3.18, '5-GramInsertions': 105}

results/WhisperS2T-TensorRT-LLM-bs_16_medium.en
                Dataset        Time
0         KINCAID46 WAV   64.459512
1         KINCAID46 MP3   69.381940
2  MultiLingualLongform  212.189986
KINCAID46_WAV.tsv {'WER': 13.34, 'IER': 1.36, 'DER': 9.09, 'SER': 2.89, '5-GramInsertions': 20}
KINCAID46_MP3.tsv {'WER': 12.14, 'IER': 1.33, 'DER': 7.93, 'SER': 2.88, '5-GramInsertions': 20}
MultiLingualLongform.tsv {'WER': 57.83, 'IER': 6.35, 'DER': 7.9, 'SER': 43.58, '5-GramInsertions': 7140}

Mar 04 '24 03:03 colinator

Oh, this is the script that prints it out - might be useful for some bigger pipeline. I'll just paste it here - not sure if I should add it to this PR yet...

# Run like this, from base:
# python -m scripts.print_wer_results --results_path results

import os
import argparse
from typing import Optional
from tools.metrics import evaluate
from tools.text_normalizer import TextNormalizer
import pandas as pd

def parse_arguments():
    parser = argparse.ArgumentParser()
    parser.add_argument('--results_path', default="results", type=str)
    args = parser.parse_args()
    return args

def results_from_tsv(path_to_tsv: str, normalize: Optional[TextNormalizer]):
    df = pd.read_csv(path_to_tsv, sep="\t")
    references = df['raw_text'].to_list()
    hypotheses = df['pred_text'].to_list()
    if normalize:
        references = [normalize(t) for t in references]
        hypotheses = [normalize(t) for t in hypotheses]
    scores = evaluate(references, hypotheses)
    return scores

def print_results_in_dir(path_to_dir: str, filenames: list[str], normalize: Optional[TextNormalizer]):
    print()
    print(path_to_dir)
    print(pd.read_csv(path_to_dir + "/infer_time.tsv", sep="\t"))
    for tsv in filenames:
        print(tsv, results_from_tsv(path_to_dir + "/" + tsv, normalize))

if __name__ == "__main__":
    args = parse_arguments()
    rd = args.results_path
    results_directories = [rd + '/' + d for d in os.listdir(rd) if os.path.isdir(os.path.join(rd, d))]
    filenames = ["KINCAID46_WAV.tsv", "KINCAID46_MP3.tsv", "MultiLingualLongform.tsv"]
    normalizer = TextNormalizer()
    for rd in results_directories:
        print_results_in_dir(rd, filenames, normalize=normalizer)

Mar 04 '24 03:03 colinator

@shashikg ^^^

Mar 05 '24 19:03 colinator

WhisperS2T WhisperS2T copied to clipboard

Added medium and medium.en models for TensorRT-LLM backend

WhisperS2T
WhisperS2T copied to clipboard