whisperX icon indicating copy to clipboard operation
whisperX copied to clipboard

Explore the usage of multilingual models

Open MahmoudAshraf97 opened this issue 2 years ago • 4 comments

I think we need to explore multilingual models such as wav2vec2-xls-r-300m-21-to-en to see if the 300M models are better than the 53M models currently used for low resource languages and see if we could use a single model for multilingual alignment

I have the code written but not thoroughly tested so I might share it after the merging of #53 but I wanted to hear your thoughts about this first

MahmoudAshraf97 avatar Feb 01 '23 01:02 MahmoudAshraf97

Good idea, multilingual phoneme ASR is probably more ideal, esp. for low-resource like you say.

But isnt wav2vec2-xls-r-300m-21-to-en for speech translation? Meaning the output is english words?

m-bain avatar Feb 01 '23 10:02 m-bain

Yes it's for translation indeed, but I thought the multilingual encoder might be of some use but I couldn't verify the idea since the output I got was the token IDs for vocab, but in your code you use logits and then softmax so I don't whether these two are equivalent, there's an option to output scores from the model but I still don't know if these are useful or not. Nonetheless there are models that have multilingual input/output such as wav2vec2-xls-r-2b-22-to-16 but I doubt these can be used on commercial GPUs, more suitable for cloud GPUs with large VRAM

MahmoudAshraf97 avatar Feb 01 '23 10:02 MahmoudAshraf97

this is a code to run these models with sample input as the given example on HF isn't correct

import torch
from transformers import Wav2Vec2Processor, AutoModelForSpeechSeq2Seq
from datasets import load_dataset

model = AutoModelForSpeechSeq2Seq.from_pretrained("facebook/wav2vec2-xls-r-300m-21-to-en")
processor = Wav2Vec2Processor.from_pretrained("facebook/wav2vec2-xls-r-300m-21-to-en")

ds = load_dataset("patrickvonplaten/librispeech_asr_dummy", "clean", split="validation")

inputs = processor(ds[0]["audio"]["array"], sampling_rate=ds[0]["audio"]["sampling_rate"], return_tensors="pt")
generated_ids = model.generate(inputs["input_values"], attention_mask=inputs["attention_mask"])
transcription = processor.batch_decode(generated_ids)

alternate_output = model.generate(**inputs, return_dict_in_generate=True, output_scores=True)

MahmoudAshraf97 avatar Feb 01 '23 10:02 MahmoudAshraf97

I found that logits can be generated when Wav2Vec2ForCTC is used instead of AutoModelForSpeechSeq2Seq, so I'll test it for a supported language and report the results

MahmoudAshraf97 avatar Feb 01 '23 10:02 MahmoudAshraf97

some update about this? :pray:

JackCloudman avatar Mar 12 '23 11:03 JackCloudman

@JackCloudman , Unfortunately I couldn't continue this, I faced bugs down the line and couldn't solve them because I don't really understand the code logic, the potential is there, but someone with more knowledge with the code needs to explore it

MahmoudAshraf97 avatar Mar 12 '23 11:03 MahmoudAshraf97