whisperX
whisperX copied to clipboard
Explore the usage of multilingual models
I think we need to explore multilingual models such as wav2vec2-xls-r-300m-21-to-en to see if the 300M models are better than the 53M models currently used for low resource languages and see if we could use a single model for multilingual alignment
I have the code written but not thoroughly tested so I might share it after the merging of #53 but I wanted to hear your thoughts about this first
Good idea, multilingual phoneme ASR is probably more ideal, esp. for low-resource like you say.
But isnt wav2vec2-xls-r-300m-21-to-en for speech translation? Meaning the output is english words?
Yes it's for translation indeed, but I thought the multilingual encoder might be of some use but I couldn't verify the idea since the output I got was the token IDs for vocab, but in your code you use logits and then softmax so I don't whether these two are equivalent, there's an option to output scores from the model but I still don't know if these are useful or not. Nonetheless there are models that have multilingual input/output such as wav2vec2-xls-r-2b-22-to-16 but I doubt these can be used on commercial GPUs, more suitable for cloud GPUs with large VRAM
this is a code to run these models with sample input as the given example on HF isn't correct
import torch
from transformers import Wav2Vec2Processor, AutoModelForSpeechSeq2Seq
from datasets import load_dataset
model = AutoModelForSpeechSeq2Seq.from_pretrained("facebook/wav2vec2-xls-r-300m-21-to-en")
processor = Wav2Vec2Processor.from_pretrained("facebook/wav2vec2-xls-r-300m-21-to-en")
ds = load_dataset("patrickvonplaten/librispeech_asr_dummy", "clean", split="validation")
inputs = processor(ds[0]["audio"]["array"], sampling_rate=ds[0]["audio"]["sampling_rate"], return_tensors="pt")
generated_ids = model.generate(inputs["input_values"], attention_mask=inputs["attention_mask"])
transcription = processor.batch_decode(generated_ids)
alternate_output = model.generate(**inputs, return_dict_in_generate=True, output_scores=True)
I found that logits can be generated when Wav2Vec2ForCTC
is used instead of AutoModelForSpeechSeq2Seq
, so I'll test it for a supported language and report the results
some update about this? :pray:
@JackCloudman , Unfortunately I couldn't continue this, I faced bugs down the line and couldn't solve them because I don't really understand the code logic, the potential is there, but someone with more knowledge with the code needs to explore it