whisperX icon indicating copy to clipboard operation
whisperX copied to clipboard

Can I get phonemes or syllable with whisperX?

Open TNBT12g opened this issue 1 year ago • 6 comments

Hello everyone, I am finding the way that can get syllable or phoneme of the English word. Is whisperX provide this feature? If you have any experience, please give me a guide. Thank you so much.

TNBT12g avatar Aug 14 '23 11:08 TNBT12g

I need it too.

chaoqingshuai avatar Sep 05 '23 01:09 chaoqingshuai

I made a HuggingFace space that does a similar thing, using a phoneme wav2vec2 model (although I didn't include whisper, just manual transcription). https://huggingface.co/spaces/gilkeyio/PhonemeForcedAlignment. Not sure how straightforward it would be or how open @m-bain would be to integrating something like this into whisperX, but the process is basically the same

gilkeyio avatar Sep 22 '23 15:09 gilkeyio

Actually, whisperX can also be used with only a tiny difference to get these!

  • Use an alignment model that has been fine tuned to output phonemes, like this one from hugging face.
  • Before doing the alignment step on the whisper output, use phonemizer to get a phoneme version of the transcript. Ex "I know now what brings me to the pyre" becomes "aɪ noʊ naʊ wʌt bɹɪŋz miː tə ðə paɪɚ"

I think both of these will work pretty well in non-English languages without any modification, not sure

import whisperx
from phonemizer import phonemize


device = "cpu"
audio_file = "pyre.wav"
batch_size = 16 # reduce if low on GPU mem
compute_type = "int8"

# 1. Transcribe with original whisper (batched)
model = whisperx.load_model("small", device, compute_type=compute_type, language="en")

audio = whisperx.load_audio(audio_file)

transcript = model.transcribe(audio, batch_size=batch_size)["segments"]
print(transcript) # 'I know now what brings me to the pyre'

# Use phonemize to get the transcript in terms of phonemes
phone_transcript = [{"text": phonemize(segment["text"]), "start":segment["start"], "end":segment["end"]} for segment in transcript]

print(phone_transcript) # 'aɪ noʊ naʊ wʌt bɹɪŋz miː tə ðə paɪɚ'

# 2. Align whisper output
model_a, metadata = whisperx.load_align_model(model_name="facebook/wav2vec2-xlsr-53-espeak-cv-ft", language_code="en", device=device)
print(metadata)

result = whisperx.align(phone_transcript, model_a, metadata, audio, device, return_char_alignments=True)

print(result["segments"]) # after alignment

#[{'start': 0.469, 'end': 2.472, 'text': 'aɪ noʊ naʊ wʌt bɹɪŋz miː tə ðə paɪɚ', 'words': [{'word': 'aɪ', 'start': 0.469, 'end': 0.609, 'score': 0.708}, {'word': 'noʊ', 'start': 0.609, 'end': 0.89, 'score': 0.694}, {'word': 'naʊ', 'start': 0.89, 'end': 1.21, 'score': 0.525}, {'word': 'wʌt', 'start': 1.21, 'end': 1.411, 'score': 0.644}, {'word': 'bɹɪŋz', 'start': 1.411, 'end': 1.731, 'score': 0.929}, {'word': 'miː', 'start': 1.731, 'end': 1.891, 'score': 0.894}, {'word': 'tə', 'start': 1.891, 'end': 2.012, 'score': 0.864}, {'word': 'ðə', 'start': 2.012, 'end': 2.152, 'score': 0.977}, {'word': 'paɪɚ', 'start': 2.152, 'end': 2.472, 'score': 0.738}], 'chars': [{'char': 'a', 'start': 0.469, 'end': 0.529, 'score': 0.667}, {'char': 'ɪ', 'start': 0.529, 'end': 0.609, 'score': 0.749}, {'char': ' '}, {'char': 'n', 'start': 0.609, 'end': 0.669, 'score': 0.449}, {'char': 'o', 'start': 0.669, 'end': 0.789, 'score': 0.834}, {'char': 'ʊ', 'start': 0.789, 'end': 0.89, 'score': 0.798}, {'char': ' '}, {'char': 'n', 'start': 0.89, 'end': 0.95, 'score': 0.658}, {'char': 'a', 'start': 0.95, 'end': 1.19, 'score': 0.917}, {'char': 'ʊ', 'start': 1.19, 'end': 1.21, 'score': 0.0}, {'char': ' '}, {'char': 'w', 'start': 1.21, 'end': 1.23, 'score': 0.308}, {'char': 'ʌ', 'start': 1.23, 'end': 1.29, 'score': 0.901}, {'char': 't', 'start': 1.29, 'end': 1.411, 'score': 0.722}, {'char': ' '}, {'char': 'b', 'start': 1.411, 'end': 1.451, 'score': 0.986}, {'char': 'ɹ', 'start': 1.451, 'end': 1.491, 'score': 0.939}, {'char': 'ɪ', 'start': 1.491, 'end': 1.591, 'score': 0.775}, {'char': 'ŋ', 'start': 1.591, 'end': 1.651, 'score': 0.981}, {'char': 'z', 'start': 1.651, 'end': 1.731, 'score': 0.964}, {'char': ' '}, {'char': 'm', 'start': 1.731, 'end': 1.771, 'score': 0.977}, {'char': 'i', 'start': 1.771, 'end': 1.891, 'score': 0.81}, {'char': 'ː'}, {'char': ' '}, {'char': 't', 'start': 1.891, 'end': 1.931, 'score': 0.978}, {'char': 'ə', 'start': 1.931, 'end': 2.012, 'score': 0.75}, {'char': ' '}, {'char': 'ð', 'start': 2.012, 'end': 2.052, 'score': 0.974}, {'char': 'ə', 'start': 2.052, 'end': 2.152, 'score': 0.98}, {'char': ' '}, {'char': 'p', 'start': 2.152, 'end': 2.192, 'score': 0.99}, {'char': 'a', 'start': 2.192, 'end': 2.252, 'score': 0.664}, {'char': 'ɪ', 'start': 2.252, 'end': 2.452, 'score': 0.884}, {'char': 'ɚ', 'start': 2.452, 'end': 2.472, 'score': 0.416}, {'char': ' '}]}]

Hope this helps! 😄

gilkeyio avatar Sep 26 '23 15:09 gilkeyio

@gilkeyio Thank bro, I will try your code

TNBT12g avatar Oct 31 '23 13:10 TNBT12g

@gilkeyio seems its not working anymore. any idea?

sadi304 avatar Mar 28 '24 05:03 sadi304

This is perfect! thank you very much for both the question and the solution. Worked for me on french audio.

hadware avatar Apr 07 '24 00:04 hadware