whisperX
whisperX copied to clipboard
Transcription/translation language change -- '50358' / '50359' is not a valid task (accepted tasks: transcribe, translate)
Bug scenario
In the following scenario:
-
model = whisperx.load_model(..., language='es')
(model e.g. large-v2) -
model.transcribe(..., language='es')
-
model.align(..., language='es)
-
model.transcribe(..., language='en')
-
model.align(..., language='en')
We got the error names:
ValueError: '50359' is not a valid task (accepted tasks: transcribe, translate)
Cause
It's based on the sub-case of one problem (default params etc.) In the following lines of code https://github.com/m-bain/whisperX/blob/f2da2f858e99e4211fe4f64b5f2938b007827e17/whisperx/asr.py#L201-L205
So, for the default method param (task = None
), it gets tokenizer.task
. But before that, the task is mapped to int by faster_whisper.tokenizer.Tokenizer
. In consequence, it gets an int name called 50359 (transcribe=50358, translate =50359).
The problem exists when we change the tokenizer language for the existing model wrapper.
Temporal fix, use following:
model.transcribe(..., task="transcribe")
# or
model.transcribe(..., task="translate")
Desired fix:
To reverse map the id of the task from the tokenizer.
Other info about that problem:
#https://github.com/huggingface/transformers/issues/22331