[Whisper] Word level and character level timestamps
Feature request
output = pipe(audio_file,chunk_length_s=30,return_timestamps=True)
Getting word level and character level timestamps with Whisper model asr pipeline upon using return_timestamps=True
Motivation
The timestamps returned currently are at stride level. For our use case, we want to get accurate timestamps for each word or possibly each character.use case
Your contribution
With guidance, happy to submit the PR.
cc @ArthurZucker and @Narsil
Hi @Rishabh-Choudhry .
This is impossible to do with whisper. Whisper simply doesn't work in such a way, it output "timestamp" tokens, roughly when it feels like. And that's all we can do with them.
I've seen hybrid approaches where you use wav2vec2 (and similar) to get those accurate timestamps and solve the potential conflicts. This is however outside of scope for the pipelines in my opinion. (Too complex, and requires running 2 different models, and impossible to align in the general case).
https://github.com/m-bain/whisperX
Would that work for you ?
This approach with DTW is more memory efficient and scalable: https://github.com/linto-ai/whisper-timestamped
Just going to bump this. There are several solutions out there and this is a pretty key missing feature from the transformer implementation of Whisper. E.g. https://github.com/jianfch/stable-ts/blob/main/stable_whisper/whisper_word_level.py
There's a PR opened for it: https://github.com/huggingface/transformers/pull/21427
If you look at it, it actually uncovered some issues with Whisper itself (in non timestamp mode, the default in transformers, not the default in openai.)
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
NB: word level timestamps were added to openai/whisper last week. Tried it out, it seems to work. https://github.com/openai/whisper/commit/500d0fe9668fae5fe2af2b6a3c4950f8a29aa145
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
I've investigated adding word-level timestamps to Transformers using the OpenAI approach of using the cross-attention weights. Preliminary results can be found in this Colab: https://colab.research.google.com/drive/1VWbAgzKWQsStdAA1hcumBU2uyFQX7zAB?usp=sharing
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
Closed by https://github.com/huggingface/transformers/pull/23205
@hollance Thanks for adding a nice feature. I know that using the cross attention weight to get the token level timestamp. Then, I think there is no dependence between doing additional finetuning and getting token level timestamp. What do you think? If I want to get token level timestamp from my finetuned model, is there anything I need to be careful about? Timestamp tokens in sentence units will be attached and trained.
@upskyy You may need to use different attention_heads on the fine-tuned model. See also: https://gist.github.com/hollance/42e32852f24243b748ae6bc1f985b13a