whisperX Can I get phonemes or syllable with whisperX?

Hello everyone, I am finding the way that can get syllable or phoneme of the English word. Is whisperX provide this feature? If you have any experience, please give me a guide. Thank you so much.

Aug 14 '23 11:08 TNBT12g

I need it too.

Sep 05 '23 01:09 chaoqingshuai

I made a HuggingFace space that does a similar thing, using a phoneme wav2vec2 model (although I didn't include whisper, just manual transcription). https://huggingface.co/spaces/gilkeyio/PhonemeForcedAlignment. Not sure how straightforward it would be or how open @m-bain would be to integrating something like this into whisperX, but the process is basically the same

Sep 22 '23 15:09 gilkeyio

Actually, whisperX can also be used with only a tiny difference to get these!

Use an alignment model that has been fine tuned to output phonemes, like this one from hugging face.
Before doing the alignment step on the whisper output, use phonemizer to get a phoneme version of the transcript. Ex "I know now what brings me to the pyre" becomes "aɪ noʊ naʊ wʌt bɹɪŋz miː tə ðə paɪɚ"

I think both of these will work pretty well in non-English languages without any modification, not sure

import whisperx
from phonemizer import phonemize


device = "cpu"
audio_file = "pyre.wav"
batch_size = 16 # reduce if low on GPU mem
compute_type = "int8"

# 1. Transcribe with original whisper (batched)
model = whisperx.load_model("small", device, compute_type=compute_type, language="en")

audio = whisperx.load_audio(audio_file)

transcript = model.transcribe(audio, batch_size=batch_size)["segments"]
print(transcript) # 'I know now what brings me to the pyre'

# Use phonemize to get the transcript in terms of phonemes
phone_transcript = [{"text": phonemize(segment["text"]), "start":segment["start"], "end":segment["end"]} for segment in transcript]

print(phone_transcript) # 'aɪ noʊ naʊ wʌt bɹɪŋz miː tə ðə paɪɚ'

# 2. Align whisper output
model_a, metadata = whisperx.load_align_model(model_name="facebook/wav2vec2-xlsr-53-espeak-cv-ft", language_code="en", device=device)
print(metadata)

result = whisperx.align(phone_transcript, model_a, metadata, audio, device, return_char_alignments=True)

print(result["segments"]) # after alignment

#[{'start': 0.469, 'end': 2.472, 'text': 'aɪ noʊ naʊ wʌt bɹɪŋz miː tə ðə paɪɚ', 'words': [{'word': 'aɪ', 'start': 0.469, 'end': 0.609, 'score': 0.708}, {'word': 'noʊ', 'start': 0.609, 'end': 0.89, 'score': 0.694}, {'word': 'naʊ', 'start': 0.89, 'end': 1.21, 'score': 0.525}, {'word': 'wʌt', 'start': 1.21, 'end': 1.411, 'score': 0.644}, {'word': 'bɹɪŋz', 'start': 1.411, 'end': 1.731, 'score': 0.929}, {'word': 'miː', 'start': 1.731, 'end': 1.891, 'score': 0.894}, {'word': 'tə', 'start': 1.891, 'end': 2.012, 'score': 0.864}, {'word': 'ðə', 'start': 2.012, 'end': 2.152, 'score': 0.977}, {'word': 'paɪɚ', 'start': 2.152, 'end': 2.472, 'score': 0.738}], 'chars': [{'char': 'a', 'start': 0.469, 'end': 0.529, 'score': 0.667}, {'char': 'ɪ', 'start': 0.529, 'end': 0.609, 'score': 0.749}, {'char': ' '}, {'char': 'n', 'start': 0.609, 'end': 0.669, 'score': 0.449}, {'char': 'o', 'start': 0.669, 'end': 0.789, 'score': 0.834}, {'char': 'ʊ', 'start': 0.789, 'end': 0.89, 'score': 0.798}, {'char': ' '}, {'char': 'n', 'start': 0.89, 'end': 0.95, 'score': 0.658}, {'char': 'a', 'start': 0.95, 'end': 1.19, 'score': 0.917}, {'char': 'ʊ', 'start': 1.19, 'end': 1.21, 'score': 0.0}, {'char': ' '}, {'char': 'w', 'start': 1.21, 'end': 1.23, 'score': 0.308}, {'char': 'ʌ', 'start': 1.23, 'end': 1.29, 'score': 0.901}, {'char': 't', 'start': 1.29, 'end': 1.411, 'score': 0.722}, {'char': ' '}, {'char': 'b', 'start': 1.411, 'end': 1.451, 'score': 0.986}, {'char': 'ɹ', 'start': 1.451, 'end': 1.491, 'score': 0.939}, {'char': 'ɪ', 'start': 1.491, 'end': 1.591, 'score': 0.775}, {'char': 'ŋ', 'start': 1.591, 'end': 1.651, 'score': 0.981}, {'char': 'z', 'start': 1.651, 'end': 1.731, 'score': 0.964}, {'char': ' '}, {'char': 'm', 'start': 1.731, 'end': 1.771, 'score': 0.977}, {'char': 'i', 'start': 1.771, 'end': 1.891, 'score': 0.81}, {'char': 'ː'}, {'char': ' '}, {'char': 't', 'start': 1.891, 'end': 1.931, 'score': 0.978}, {'char': 'ə', 'start': 1.931, 'end': 2.012, 'score': 0.75}, {'char': ' '}, {'char': 'ð', 'start': 2.012, 'end': 2.052, 'score': 0.974}, {'char': 'ə', 'start': 2.052, 'end': 2.152, 'score': 0.98}, {'char': ' '}, {'char': 'p', 'start': 2.152, 'end': 2.192, 'score': 0.99}, {'char': 'a', 'start': 2.192, 'end': 2.252, 'score': 0.664}, {'char': 'ɪ', 'start': 2.252, 'end': 2.452, 'score': 0.884}, {'char': 'ɚ', 'start': 2.452, 'end': 2.472, 'score': 0.416}, {'char': ' '}]}]

Hope this helps! 😄

Sep 26 '23 15:09 gilkeyio

@gilkeyio Thank bro, I will try your code

Oct 31 '23 13:10 TNBT12g

@gilkeyio seems its not working anymore. any idea?

Mar 28 '24 05:03 sadi304

This is perfect! thank you very much for both the question and the solution. Worked for me on french audio.

Apr 07 '24 00:04 hadware

whisperX whisperX copied to clipboard

Can I get phonemes or syllable with whisperX?

whisperX
whisperX copied to clipboard