pywhispercpp word-level timestamps?

Hi - thanks for making this. I was trying to get word-level timestamps, but haven't been able to figure out how to. Any tips? Thanks again!

Dec 14 '23 18:12 antiboredom

Hi @antiboredom, You are welcome :) 'Glad you found it useful.

To achieve word-level timestamps, you will need to enable token_timestamps and set max_len to 1, like the following:

from pywhispercpp.model import Model

model = Model('base.en', n_threads=6)
words = model.transcribe('file.mp3', token_timestamps=True, max_len=1)
for word in words:
    print(word.text)

Dec 14 '23 23:12 absadiki

Thank you! not sure why I was having trouble sorting that out myself!!

One more thing, and I'm not sure if this is just a whisper thing or related to your project, but I'm seeing one longer word being broken up. In my test case, "Enormous" is becoming "En", "orm", "ous". Any ideas why that might be happening?

Dec 15 '23 00:12 antiboredom

it's a bit tricky to figure it out, as it is not an exact word-level timestamp per say, in fact you can set the max_len to whatever number of chars you want, so when you set max_len to 1, every token will be in its own line, and it will give similar results to a word-level timestamps.

And I think this is the problem with your test case, it seems like "Enormous" is tokenized into 3 tokens, and you get every token by its own. Although, I've never get such a case!

Can you try for example to change the max_len to 8 for example ?

Dec 15 '23 02:12 absadiki

Interesting! When I try max_len set to 8, I get "Enorm" and "ous", and then occasionally multiple words like "and if" appearing on the same line... I have also tried faster-whisper which does work as expected for word-level timestamps, but is significantly slower than your implementation...

Dec 17 '23 03:12 antiboredom

You still get two separate words from "Enormous" even after max_len set to 8, interesting test case! Could you please share the audio file with me, I would like to test it on my end ?

Yes Faster-whisper is great and should give you good results and it should be as fast as well, at least when I test it a while ago! But I didn't compare the performance of the two implementations to be honest.

Dec 18 '23 01:12 absadiki

@antiboredom @abdeladim-s I think you might want to try out the sow -aka split on word- option from whisper.cpp. I'm not sure really but I think it concatenates tokens not starting with a whitespace, thus keeping tokens forming a single word together. So, you may want -ml 1 -sow options together.

shortened output from ./main -h whisper.cpp repo: -sow, --split-on-word [false ] split on word rather than on token

Aug 19 '24 11:08 dkakaie

@dkakaie, I couldn't reproduce the issue with my test files at that time, but yes, you're probably right. The split_on_word can be used as a parameter in the transcribe function as well.

Thanks @dkakaie for pointing that out!

Aug 19 '24 17:08 absadiki

Thanks for your great project. @absadiki we got the same issue, and followed above conversation, and got this.

in the pic, we found, eg. t0=0, t1=27, text="my," means 270ms?

am'i getting this right?

Mar 27 '25 03:03 yuanshanxiaoni

@yuanshanxiaoni, Glad you found the project useful.

Yes, you’re correct, timestamps are in centiseconds, 27 corresponds to 270 ms.

Mar 27 '25 21:03 absadiki

We tested that the parameter max-len of whisper-cli can completely split the character-level timestamp in Chinese, but it does not seem to work in pywhispercpp, but the English setting is OK. Please see what the reason is.

Whether max-len is 1 or 3, it has no effect

Mar 28 '25 09:03 efwfe

@efwfe, why are you using split_on_word=True, you are not using it with the whisper-cli!!

Mar 29 '25 23:03 absadiki

@efwfe, why are you using split_on_word=True, you are not using it with the whisper-cli!!

Thanks for your reply, It works after remove split_on_word=True.

Mar 31 '25 01:03 efwfe

@efwfe, why are you using split_on_word=True, you are not using it with the whisper-cli!! Hello, I use same mp3 to whispercpp and whisper-cli, the timestamp not same. [right is whispercpp], parameters are same.

Apr 03 '25 04:04 efwfe

@efwfe, are you using the same model ? please provide the entire code and command.

Apr 03 '25 23:04 absadiki

@efwfe, are you using the same model ? please provide the entire code and command.

Thanks for your previous reply!

As you suggested, here's a code snippet that illustrates the issue I'm seeing with the timestamp mismatch between your tool and whisper-cli:

this is the whisper-cli command and the output:

this is the pywhispercpp and the output:

AudioTestFile

I'm trying to understand what might be causing this—perhaps your tool includes a different pre-processing pipeline or segment merging logic? I'd appreciate any guidance you can offer on how your timestamps are computed, and how I might align them more closely with whisper-cli.

Thanks again for your support!

Apr 15 '25 14:04 efwfe

@efwfe, Thanks for the details. Yes the pre-processing was a but different., that's why you were getting different results. I tried to make it as close as possible. Please pull the latest commit and give it a try.

Apr 19 '25 20:04 absadiki

@absadiki would be great if you update the documentations, I found this important note in the source code:

'max_len': {
            'type': int,
            'description': "max segment length in characters, note: token_timestamps needs to be set to True for this to work",
            'options': None,
            'default': 0
    },

P.S. couldn't find the docs source to open a PR for the update

May 25 '25 13:05 SHi-ON

@SHi-ON, I wasn't aware that the docs were not updated. They are generated from the source code and deployed using CI. This should be fixed now.

May 26 '25 01:05 absadiki