whisperX
whisperX copied to clipboard
Can I get phonemes or syllable with whisperX?
Hello everyone, I am finding the way that can get syllable or phoneme of the English word. Is whisperX provide this feature? If you have any experience, please give me a guide. Thank you so much.
I need it too.
I made a HuggingFace space that does a similar thing, using a phoneme wav2vec2 model (although I didn't include whisper, just manual transcription). https://huggingface.co/spaces/gilkeyio/PhonemeForcedAlignment. Not sure how straightforward it would be or how open @m-bain would be to integrating something like this into whisperX, but the process is basically the same
Actually, whisperX can also be used with only a tiny difference to get these!
- Use an alignment model that has been fine tuned to output phonemes, like this one from hugging face.
- Before doing the alignment step on the whisper output, use phonemizer to get a phoneme version of the transcript. Ex "I know now what brings me to the pyre" becomes "aɪ noʊ naʊ wʌt bɹɪŋz miː tə ðə paɪɚ"
I think both of these will work pretty well in non-English languages without any modification, not sure
import whisperx
from phonemizer import phonemize
device = "cpu"
audio_file = "pyre.wav"
batch_size = 16 # reduce if low on GPU mem
compute_type = "int8"
# 1. Transcribe with original whisper (batched)
model = whisperx.load_model("small", device, compute_type=compute_type, language="en")
audio = whisperx.load_audio(audio_file)
transcript = model.transcribe(audio, batch_size=batch_size)["segments"]
print(transcript) # 'I know now what brings me to the pyre'
# Use phonemize to get the transcript in terms of phonemes
phone_transcript = [{"text": phonemize(segment["text"]), "start":segment["start"], "end":segment["end"]} for segment in transcript]
print(phone_transcript) # 'aɪ noʊ naʊ wʌt bɹɪŋz miː tə ðə paɪɚ'
# 2. Align whisper output
model_a, metadata = whisperx.load_align_model(model_name="facebook/wav2vec2-xlsr-53-espeak-cv-ft", language_code="en", device=device)
print(metadata)
result = whisperx.align(phone_transcript, model_a, metadata, audio, device, return_char_alignments=True)
print(result["segments"]) # after alignment
#[{'start': 0.469, 'end': 2.472, 'text': 'aɪ noʊ naʊ wʌt bɹɪŋz miː tə ðə paɪɚ', 'words': [{'word': 'aɪ', 'start': 0.469, 'end': 0.609, 'score': 0.708}, {'word': 'noʊ', 'start': 0.609, 'end': 0.89, 'score': 0.694}, {'word': 'naʊ', 'start': 0.89, 'end': 1.21, 'score': 0.525}, {'word': 'wʌt', 'start': 1.21, 'end': 1.411, 'score': 0.644}, {'word': 'bɹɪŋz', 'start': 1.411, 'end': 1.731, 'score': 0.929}, {'word': 'miː', 'start': 1.731, 'end': 1.891, 'score': 0.894}, {'word': 'tə', 'start': 1.891, 'end': 2.012, 'score': 0.864}, {'word': 'ðə', 'start': 2.012, 'end': 2.152, 'score': 0.977}, {'word': 'paɪɚ', 'start': 2.152, 'end': 2.472, 'score': 0.738}], 'chars': [{'char': 'a', 'start': 0.469, 'end': 0.529, 'score': 0.667}, {'char': 'ɪ', 'start': 0.529, 'end': 0.609, 'score': 0.749}, {'char': ' '}, {'char': 'n', 'start': 0.609, 'end': 0.669, 'score': 0.449}, {'char': 'o', 'start': 0.669, 'end': 0.789, 'score': 0.834}, {'char': 'ʊ', 'start': 0.789, 'end': 0.89, 'score': 0.798}, {'char': ' '}, {'char': 'n', 'start': 0.89, 'end': 0.95, 'score': 0.658}, {'char': 'a', 'start': 0.95, 'end': 1.19, 'score': 0.917}, {'char': 'ʊ', 'start': 1.19, 'end': 1.21, 'score': 0.0}, {'char': ' '}, {'char': 'w', 'start': 1.21, 'end': 1.23, 'score': 0.308}, {'char': 'ʌ', 'start': 1.23, 'end': 1.29, 'score': 0.901}, {'char': 't', 'start': 1.29, 'end': 1.411, 'score': 0.722}, {'char': ' '}, {'char': 'b', 'start': 1.411, 'end': 1.451, 'score': 0.986}, {'char': 'ɹ', 'start': 1.451, 'end': 1.491, 'score': 0.939}, {'char': 'ɪ', 'start': 1.491, 'end': 1.591, 'score': 0.775}, {'char': 'ŋ', 'start': 1.591, 'end': 1.651, 'score': 0.981}, {'char': 'z', 'start': 1.651, 'end': 1.731, 'score': 0.964}, {'char': ' '}, {'char': 'm', 'start': 1.731, 'end': 1.771, 'score': 0.977}, {'char': 'i', 'start': 1.771, 'end': 1.891, 'score': 0.81}, {'char': 'ː'}, {'char': ' '}, {'char': 't', 'start': 1.891, 'end': 1.931, 'score': 0.978}, {'char': 'ə', 'start': 1.931, 'end': 2.012, 'score': 0.75}, {'char': ' '}, {'char': 'ð', 'start': 2.012, 'end': 2.052, 'score': 0.974}, {'char': 'ə', 'start': 2.052, 'end': 2.152, 'score': 0.98}, {'char': ' '}, {'char': 'p', 'start': 2.152, 'end': 2.192, 'score': 0.99}, {'char': 'a', 'start': 2.192, 'end': 2.252, 'score': 0.664}, {'char': 'ɪ', 'start': 2.252, 'end': 2.452, 'score': 0.884}, {'char': 'ɚ', 'start': 2.452, 'end': 2.472, 'score': 0.416}, {'char': ' '}]}]
Hope this helps! 😄
@gilkeyio Thank bro, I will try your code
@gilkeyio seems its not working anymore. any idea?
This is perfect! thank you very much for both the question and the solution. Worked for me on french audio.