No matching voice found
echogarden align a_km-KH.mp3 a_km-KH.txt a_km-KH.srt --language=km --subtitles.mode=line --subtitles.maxAddedDuration=0.18 --engine=dtw-ra --recognition.engine=whisper --recognition.whisper.model = base
I keep getting the error: Synthesize ground-truth transcript with eSpeak.. Error: No matching voice found How can I fix this?
dtw-ra uses eSpeak as part of the alignment process, to synthesize reference speech audio.
Looking at the eSpeak voice list, I don't see language code km (Khmer), also no mention of khm (other possible language code for Khmer), so it can't generate the needed speech reference to complete the alignment.
How can I solve this problem? Using another tool? Or a different engine? Currently, all I want is to align and generate the srt file. Thanks for your reply.
The whisper alignment engine doesn't rely on eSpeak and the Whisper multilingual models do support Khmer, so it should work.
It's kind of slow on Echogarden v2.x.x, though, since it uses the ONNX runtime-based implementation.
In Echogarden v3.0.0 (not released yet, but very close), there are substantial improvements to its speed since I switched its underlying engine to use the whisper.cpp C++ library internally (not using ONNX runtime anymore for Whisper). It will have the same speed as whisper.cpp.
I've used the whisper alignment engine before, but it's inaccurate and slow. Is my command incorrect? echogarden align a_km-KH.mp3 a_km-KH.txt a_km-KH.srt --language=km --subtitles.mode=line --subtitles.maxAddedDuration=0.18 --engine=whisper --whisper.model=large-v3-turbo
From my experience, large-v3-turbo doesn't perform very well for many tasks.
tiny, base or small usually produce best result.
Actually, in my experience, for general transcription, small usually produces better results than large-v3-turbo.
For the whisper alignment engine, you simply don't need a large model. There's no speech recognition being done. It's using a form of "forced" decoding of the transcript you give it, and then extracts timestamps.
In version v3.0.0 the speed will be as fast, or faster than whisper.cpp for the same model and input. The change is very significant, both for CPU and GPU, up to 10x or more at times. I'll try to accelerate the release to be sooner, though the new version may introduce new bugs I didn't catch while testing.
I've used Base, but the alignment results are also very poor. Currently, I don't need to transcribe, just align. (I have a project that requires alignment, so I hope there is a solution. Thank you.)
I don't know, maybe Whisper models simply aren't that good in producing accurate Khmer timestamps.
If you could possibly send an example pair of an audio file and transcript, I can try to test it locally and see what kind of results it gets with different parameters.
https://drive.google.com/drive/folders/137DlCcGU7A3leZfL72E7EH-9nFhigJAk?usp=sharing
thx
Yes, Whisper is producing nonsense transcriptions for this audio, even when language is specified correctly. So, if it can't produce a decent transcription, it can't align as well.
I tested with whisper-cli (whisper.cpp command line tool) to ensure that. It produced:
main: processing 'Khmer1.wav' (5346342 samples, 334.1 sec), 4 threads, 1 processors, 5 beams + best of 5, lang = km, task = transcribe, timestamps = 1 ...
[00:00:00.000 --> 00:00:30.000] ឍ ឍ ឍ ឍ ឍ ឍ ឍ ឍ ឍ ឍ ឍ ឍ ឍ ឍ ឍ ឍ ឍ ឍ ឍ ឍ ឍ ឍ ឍ ឍ ឍ ឍ ឍ ឍ ឍ ឍ ឍ ឍ ឍ ឍ ឍ ឍ ឍ ឍ ឍ ឍ ឍ ឍ ឍ ឍ ឍ ឍ ឍ ឍ ឍ ឍ ឍ ឍ ឍ ឍ ឍ ឍ ឍ ឍ ឍ ឍ ឍ ឍ ឍ ឍ ឍ ឍ ឍ ឍ ឍ ឍ ឍ ឍ ឍ
It's likely because Khmer is not a fully supported language of Whisper (official documentation).
These languages are fully supported:
Afrikaans, Arabic, Armenian, Azerbaijani, Belarusian, Bosnian, Bulgarian, Catalan, Chinese, Croatian, Czech, Danish, Dutch, English, Estonian, Finnish, French, Galician, German, Greek, Hebrew, Hindi, Hungarian, Icelandic, Indonesian, Italian, Japanese, Kannada, Kazakh, Korean, Latvian, Lithuanian, Macedonian, Malay, Marathi, Maori, Nepali, Norwegian, Persian, Polish, Portuguese, Romanian, Russian, Serbian, Slovak, Slovenian, Spanish, Swahili, Swedish, Tagalog, Tamil, Thai, Turkish, Ukrainian, Urdu, Vietnamese, and Welsh.
For other languages, it says:
While the underlying model was trained on 98 languages, we only list the languages that exceeded <50% word error rate (WER)
So likely Khmer didn't exceed 50% word error rate (meaning results are poor in practice)