faster-whisper
faster-whisper copied to clipboard
Using distil-whisper-large-v3 German Model from HF with faster-whisper?
I want to use the distil-whisper-large-v3-de-kd model from Hugging Face with faster-whisper. The distil-whisper-large-v2 model supports only English, but I need German language support for my project.
Is it possible to directly use the German model with faster-whisper, or does it need to be converted (e.g., with CTranslate2) for compatibility?
@Arche151 , no. You need to convert it through Ctranslate2. Example:
ct2-transformers-converter --model sanchit-gandhi/distil-whisper-large-v3-de-kd --output_dir distil-whisper-large-v3-de-kd-ct2 --copy_files tokenizer.json preprocessor_config.json --quantization float16
Then, when initializing the Whisper model, you can pass the model path pointing to your converted model:
model = WhisperModel('/tmp/distil-whisper-large-v3-de-kd-ct2', device='cuda')
Hope that it's helpful for you.
@trungkienbkhn okay, thanks a lot for the information and required commands :)
I will give it a go!
Do you maybe also know whether I need a lot of compute or a GPU for the conversion and how long it should take?
@trungkienbkhn So, I converted the model and used float16 quantization and the quality of the transcription compared to the original large-v3 is really bad :(
A lot of words are transcribed falsely, some words are just not transcribed at all and some words are transcribed twice, so there's duplicates.
In my test script I wrote this: model = WhisperModel(/path/distil-whisper-large-v3-de-kd-ct2", device="cpu", compute_type="int8")
Since your model was converted from Distil model, so you should add option condition_on_previous_text=False
when transcribing. For more info, see this comment: https://github.com/SYSTRAN/faster-whisper/pull/557#issuecomment-1837394755
@trungkienbkhn I did that already, so I don't think, that's the issue, unfortunately.
@Arche151 same for me, very poor (and strange) results for the distilled de model, converted with the recommended ctranslate2 code from @trungkienbkhn
ct2-transformers-converter --model sanchit-gandhi/distil-whisper-large-v3-de-kd --output_dir distil-whisper-large-v3-de-kd-ct2 --copy_files tokenizer.json preprocessor_config.json --quantization float16
faster-distil-de transcript generated with the following parameters:
model = WhisperModel('path_to_distil_de_ct_model', device="cpu", compute_type="int8") segments, info = model.transcribe("hawry.mp3", beam_size=5, vad_filter=True, language="de", condition_on_previous_text=False)
Here are the first parts of transcripts from a short interview file
Timestamps are all over the place.
*Transcript with faster-distilled-de model: Detected language 'de' with probability 1.000000 [3.06s -> 3.18s] , was ist du gerne zum Frühstück? Ich esseier, ich mag Spiegel-Eier. [34.43s -> 34.55s] , kein Kaffee, kein Kaffee, ist mir lieber als ein Kaffee, Eier, Fascheterfleisch? [64.43s -> 94.43s] Zum Frühstück, also ein warmes Frühstück auch, und ein Leber und Lammspießfleisch. [94.43s -> 124.43s] Gepfee, die werden gebraten, die werden gebraten, und es bei uns auch getrocknete Zitronentee.
Here for comparison the faster-v3 transcript beginning
*Transcript with faster-v3 [3.06s -> 5.70s] Hauri, was isst du gerne zum Frühstück? [6.48s -> 16.99s] Ich esse gerne Eier, Käse, Joghurt zum Frühstück. [17.71s -> 20.41s] Und welche Eier? Gekochte Eier? [21.99s -> 25.07s] Ich mag Spiegeleier gerne. [25.25s -> 29.13s] Ah, okay. Gut. Und was trinkst du gerne in der Früh? [29.71s -> 31.79s] Ich trinke schwarzen Tee. [31.97s -> 32.65s] Schwarzen Tee? [32.65s -> 33.79s] Ja, ist mir lieber. [33.79s -> 35.49s] Okay. Keinen Kaffee? [35.99s -> 38.91s] Nein, keinen Kaffee. Ist mir lieber als einen Kaffee. [39.13s -> 45.49s] Ja. Und Hauri, du kommst aus dem Irak. Was isst man dort traditionell zum Frühstück? [45.85s -> 59.03s] Dort isst man auch Käse, Marmelade, Honig, Butter, Eier, verschirrtes Fleisch.
@JuergenFleiss did you check how the original (non-translated) models behave respectively? Is this really a problem of the translation with CTranslate?
OK, i tried it without quantization, and it works like a charm. No idea if that's an option for you, for me it's great, since i frequently had problems just because of the model size with large-v2. Used the same line as @JuergenFleiss just without the --quantization parameter