WhisperS2T
WhisperS2T copied to clipboard
Added medium and medium.en models for TensorRT-LLM backend
Seems to work for "medium" and "medium.en" models now, for tensorrt-llm backend. Fixes #30
Hi @colinator can you run some WER checks on medium and medium.en models for TensorRT-LLM backend? According to TensorRT-LLM repo, they only support large model.
You can use these to run the tests: Prepare the env using this script: https://github.com/shashikg/WhisperS2T/blob/main/prepare_benchmark_env.sh Then use this: https://github.com/shashikg/WhisperS2T/blob/main/scripts/benchmark_whisper_s2t.py
Ahoy there. I couldn't find where in tensorrt-llm it only was compatible with large-v2. Maybe in the 'builder'? But that's just the builder. Your code seemed to work. The benchmark doesn't calculate WER right? The transcriptions seem plausible though:
data/KINCAID46/audio/9.mp3 For over a decade, on streets, and in bars, and in living rooms, from Rio to Reykjavik, and everywhere in between, the debate has raged. Lionel Messi, or Christiano Ronaldo. Who is the greatest player in the world? The truth is, we've never seen a rivalry like this. Not in football, not in any of our major sports. But that debate might soon be over, and it might end without Messi or Ronaldo getting the trophy they so glaringly lack, the World Cup. To understand why it matters that two individuals haven't won a team trophy that's only up for grabs every 4 years, you have to know just how dominate Messi and Ronaldo have been. The winner is, Christiano Ronaldo. From 2008 through 2017, the Ballon d'Or, basically the sports most valuable player award, has gone to either Ronaldo or Messi. Five for Ronaldo, five for Messi, none for anyone else. It's not just Ballon d'Or's for Messi and Ronaldo. Messi is the all time leading scorer in La Liga, in the Spanish super ...
You want me to attach the transcription csv files?
Do have a script that performs WER calculation from the csv outputs? I see your WER function, but am not totally clear on any pre-processing (lowercasing, etc) you do when you calculate it...
Well, here are the outputs. I'm on an RTX 3080, so slower than your results, and batch size is 16, because mem$.
Do have a script that performs WER calculation from the csv outputs? I see your WER function, but am not totally clear on any pre-processing (lowercasing, etc) you do when you calculate it...
Hey yes I normalize the text and then performs lowercasing as well. Here: https://github.com/shashikg/WhisperS2T/blob/main/tools/text_normalizer.py#L75
Then run this evaluate function on normalized texts: https://github.com/shashikg/WhisperS2T/blob/main/tools/metrics.py#L68
BTW, I quickly checked the outputs txt files, output looks good to me.
Hi @colinator any update?
I got this, for medium and medium.en. Card is rtx3080, if that matters...
Why is medium.en so much worse?
results/WhisperS2T-TensorRT-LLM-bs_16_medium
Dataset Time
0 KINCAID46 WAV 66.376619
1 KINCAID46 MP3 66.754254
2 MultiLingualLongform 158.262608
KINCAID46_WAV.tsv {'WER': 9.11, 'IER': 1.56, 'DER': 4.49, 'SER': 3.05, '5-GramInsertions': 35}
KINCAID46_MP3.tsv {'WER': 9.52, 'IER': 1.58, 'DER': 4.86, 'SER': 3.08, '5-GramInsertions': 31}
MultiLingualLongform.tsv {'WER': 9.4, 'IER': 3.03, 'DER': 3.19, 'SER': 3.18, '5-GramInsertions': 105}
results/WhisperS2T-TensorRT-LLM-bs_16_medium.en
Dataset Time
0 KINCAID46 WAV 64.459512
1 KINCAID46 MP3 69.381940
2 MultiLingualLongform 212.189986
KINCAID46_WAV.tsv {'WER': 13.34, 'IER': 1.36, 'DER': 9.09, 'SER': 2.89, '5-GramInsertions': 20}
KINCAID46_MP3.tsv {'WER': 12.14, 'IER': 1.33, 'DER': 7.93, 'SER': 2.88, '5-GramInsertions': 20}
MultiLingualLongform.tsv {'WER': 57.83, 'IER': 6.35, 'DER': 7.9, 'SER': 43.58, '5-GramInsertions': 7140}
Oh, this is the script that prints it out - might be useful for some bigger pipeline. I'll just paste it here - not sure if I should add it to this PR yet...
# Run like this, from base:
# python -m scripts.print_wer_results --results_path results
import os
import argparse
from typing import Optional
from tools.metrics import evaluate
from tools.text_normalizer import TextNormalizer
import pandas as pd
def parse_arguments():
parser = argparse.ArgumentParser()
parser.add_argument('--results_path', default="results", type=str)
args = parser.parse_args()
return args
def results_from_tsv(path_to_tsv: str, normalize: Optional[TextNormalizer]):
df = pd.read_csv(path_to_tsv, sep="\t")
references = df['raw_text'].to_list()
hypotheses = df['pred_text'].to_list()
if normalize:
references = [normalize(t) for t in references]
hypotheses = [normalize(t) for t in hypotheses]
scores = evaluate(references, hypotheses)
return scores
def print_results_in_dir(path_to_dir: str, filenames: list[str], normalize: Optional[TextNormalizer]):
print()
print(path_to_dir)
print(pd.read_csv(path_to_dir + "/infer_time.tsv", sep="\t"))
for tsv in filenames:
print(tsv, results_from_tsv(path_to_dir + "/" + tsv, normalize))
if __name__ == "__main__":
args = parse_arguments()
rd = args.results_path
results_directories = [rd + '/' + d for d in os.listdir(rd) if os.path.isdir(os.path.join(rd, d))]
filenames = ["KINCAID46_WAV.tsv", "KINCAID46_MP3.tsv", "MultiLingualLongform.tsv"]
normalizer = TextNormalizer()
for rd in results_directories:
print_results_in_dir(rd, filenames, normalize=normalizer)
@shashikg ^^^