[WHISPER] Add language to whisper output
Feature request
Adding the translated language in whisper output, in addition to currently returned text and chunks
Whisper outputs language tag and for autodetection is important to have this feature as some use-cases don't know the language of the translation.
Motivation
One example usage is, for both transcription and translation, detecting if the language is en, we don't need to add additional translation.
Your contribution
Tried, couldn't find where
We'll be adding a tokenizer_kwargs, to allow the skip_special_tokens to be overwritten. This should allow you to do something like
>>> out = pipeline(..., tokenizer_kwargs={"skip_special_tokens": False}, return_timestamps=True, max_length = 2)
"<startoftranscript><en>"
Then either you regex or encode with the tokenizer and that should do the trick. cc @Narsil as we talked about this offline
Would that work for you ?
I... am not sure?
I can only come at this as a fairly clueless dev that barely understands tokenization. In that case, compared to how whisper is built, the above seems very complex to do.
@ArthurZucker I think as we chatted, you guys have many limitations in keeping pipelines generic features.
Could there be an easier way to get the detected language?
Maybe exposing the detect_language feature of whisper via pipe.model.detect_language(audio_file) somehow?
As a workaround I'm loading and running whisper base to just detect language, I would love to be at least able to use the loaded transformers whisper.
So far, no luck.
detect_language is not exposed .
and running pipe.model.generate() on the source file gives me :
{AttributeError}'str' object has no attribute 'shape'
Which I assume is because generate needs a numpy array of the audio? 🤔
But def. complex for the average user
For anyone getting here, I found a better workaround. Thanks to @ArthurZucker notebook with examples: https://colab.research.google.com/drive/1rS1L4YSJqKUH_3YxIQHBI982zso23wor?usp=sharing#scrollTo=i5sKbZpsY9-J
It does still require a whisper dependency, but doesn't load the openai whisper model into memory at all, just uses it's utils and dependencies on ffmpeg-python.
audio = whisper.load_audio(source_file)
short_audio_for_lang_detection = whisper.pad_or_trim(audio)
inputs = pipe.feature_extractor(short_audio_for_lang_detection,
return_tensors="pt",sampling_rate=16_000).input_features.to(pipe.device)
lang_token = pipe.model.generate(inputs, max_new_tokens=1)[0, 1]
detected_language_token = pipe.tokenizer.decode(lang_token)
detected_language = detected_language_token[2:-2]
language_title = LANGUAGES[detected_language]
log.info(f"Detected language: {language_title}")
pipe.feature_extractor(short_audio_for_lang_detection) should by default give only the first 30seconds, so hort_audio_for_lang_detection = whisper.pad_or_trim(audio) is probably useless.
@Narsil how about we make
if isinstance(inputs, str):
if inputs.startswith("http://") or inputs.startswith("https://"):
# We need to actually check for a real protocol, otherwise it's impossible to use a local file
# like http_huggingface_co.png
inputs = requests.get(inputs).content
else:
with open(inputs, "rb") as f:
inputs = f.read()
if isinstance(inputs, bytes):
inputs = ffmpeg_read(inputs, self.feature_extractor.sampling_rate)
into a simple function that you call in the preprocess? This could remove all whisper dependencies in this example. WDYT?
could be awesome to have model.detect_language instead of all the mess above and dependencies on whisper!
into a simple function that you call in the preprocess?
Sure, I'm not sure I understand how that cleans up the audio trimming, but we can definitely abstract away.
could be awesome to have model.detect_language instead of all the mess above and dependencies on whisper!
If you have some good ideas, please suggest them instead of waving them out.
Unfortunately, we can't just add detect_language whereever it may be. Whisper is not the final model for audio, when 3 months down the line and another model which works entirely differently comes into play, and we have specified detect_language for whisper, we're going to be in a bad shape to support this new shiny model in a seamless fashion. Making model specific code is trivial, this is what the snippet provided above is for. Making abstractions over many models which work very differently is much harder, and that's what we're trying to do. So that users can switch to new shiny model later, without rewriting their entire code.
model doesn't own and will never own feature_extractor which is required for the mel extraction of the audio for instance so model.detect_language doesn't work.
Then pipeline works on large audio, by chunking them into some length in seconds, so potentially a single file could have multiple languages being detected in each chunk, so we have to account for that.
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
This was fixed by #21427 closing it!