transformers [Whisper] Word level and character level timestamps

Feature request

output = pipe(audio_file,chunk_length_s=30,return_timestamps=True)

Getting word level and character level timestamps with Whisper model asr pipeline upon using return_timestamps=True

Motivation

The timestamps returned currently are at stride level. For our use case, we want to get accurate timestamps for each word or possibly each character.use case

Your contribution

With guidance, happy to submit the PR.

Feb 02 '23 01:02 Rishabh-Choudhry

cc @ArthurZucker and @Narsil

Feb 02 '23 14:02 sgugger

Hi @Rishabh-Choudhry .

This is impossible to do with whisper. Whisper simply doesn't work in such a way, it output "timestamp" tokens, roughly when it feels like. And that's all we can do with them.

I've seen hybrid approaches where you use wav2vec2 (and similar) to get those accurate timestamps and solve the potential conflicts. This is however outside of scope for the pipelines in my opinion. (Too complex, and requires running 2 different models, and impossible to align in the general case).

https://github.com/m-bain/whisperX

Would that work for you ?

Feb 02 '23 15:02 Narsil

This approach with DTW is more memory efficient and scalable: https://github.com/linto-ai/whisper-timestamped

Feb 02 '23 22:02 JeffreyWardman

Just going to bump this. There are several solutions out there and this is a pretty key missing feature from the transformer implementation of Whisper. E.g. https://github.com/jianfch/stable-ts/blob/main/stable_whisper/whisper_word_level.py

Feb 09 '23 05:02 JeffreyWardman

There's a PR opened for it: https://github.com/huggingface/transformers/pull/21427

If you look at it, it actually uncovered some issues with Whisper itself (in non timestamp mode, the default in transformers, not the default in openai.)

Feb 09 '23 09:02 Narsil

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

Mar 05 '23 15:03 github-actions[bot]

NB: word level timestamps were added to openai/whisper last week. Tried it out, it seems to work. https://github.com/openai/whisper/commit/500d0fe9668fae5fe2af2b6a3c4950f8a29aa145

Mar 13 '23 05:03 laphang

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

May 02 '23 15:05 github-actions[bot]

I've investigated adding word-level timestamps to Transformers using the OpenAI approach of using the cross-attention weights. Preliminary results can be found in this Colab: https://colab.research.google.com/drive/1VWbAgzKWQsStdAA1hcumBU2uyFQX7zAB?usp=sharing

May 03 '23 16:05 hollance

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

May 28 '23 15:05 github-actions[bot]

Closed by https://github.com/huggingface/transformers/pull/23205

Jun 22 '23 09:06 hollance

@hollance Thanks for adding a nice feature. I know that using the cross attention weight to get the token level timestamp. Then, I think there is no dependence between doing additional finetuning and getting token level timestamp. What do you think? If I want to get token level timestamp from my finetuned model, is there anything I need to be careful about? Timestamp tokens in sentence units will be attached and trained.

Jul 07 '23 05:07 upskyy

@upskyy You may need to use different attention_heads on the fine-tuned model. See also: https://gist.github.com/hollance/42e32852f24243b748ae6bc1f985b13a

Jul 07 '23 11:07 hollance