echogarden icon indicating copy to clipboard operation
echogarden copied to clipboard

Whisper.cpp parseResultObject failure/edge case

Open smoores-dev opened this issue 1 year ago • 8 comments

When recognizing some border-line pathological audio content, apparently Whisper.cpp sometimes will output tokens without offset properties, resulting in the following error:

await recognize(`./00000-00009.mp4`, {engine: "whisper.cpp", language: "en", whisper: {model: "tiny.en", build: "cpu"}})
Uncaught TypeError: Cannot read properties of undefined (reading 'from')
    at parseResultObject (file:///home/smoores/code/storyteller/node_modules/echogarden/dist/recognition/WhisperCppSTT.js:187:49)

Here's the audio asset in question:

https://github.com/user-attachments/assets/68a6bac2-6461-4787-8943-821f6c5d0311

It's a TTS narration of a passage that includes, at a few points, the phrase "I love you" several dozen times in a row.

smoores-dev avatar Jul 31 '24 02:07 smoores-dev

The lines possibly related are:

if (tokenIndex === 0 && tokenObject.text === '[_BEG_]' && tokenObject.offsets.from === 0) {
	currentCorrectionTimeOffset = segmentObject.offsets.from / 1000
}

and

startTime = tokenObject.offsets.from / 1000
endTime = tokenObject.offsets.to / 1000

The code makes the assumption that offsets.from and offsets.to are always available.

Anyway, the whisper.cpp build used by default has become slightly outdated now (early April 2024). Can you try with a newer whisper.cpp build (v1.6.0 seems to be the latest published with actual binaries) to see if the problem was maybe fixed since then?

You can set a custom main executable with whisperCpp.executablePath.

If that doesn't help, I'll see how I can workaround the issue to prevent the error.

rotemdan avatar Jul 31 '24 02:07 rotemdan

Yeah, this is unfortunately happening even when building directly from HEAD on the master branch of the whisper.cpp repo! I just ran the whisper.cpp command with the same flags as echogarden and found the problem token; at the end of the first very long string of "I love you"s, the last "you" token looks like this:

{
	"text": " you",
	"id": 291,
	"p": 0.960787,
	"t_dtw": -1
}

It has neither timestamps nor offsets!

smoores-dev avatar Jul 31 '24 03:07 smoores-dev

Thanks a lot for the investigation.

I guess the issue can be reported on the whisper.cpp repository, if it hasn't already.

For now, I can work around the issue by filling in missing timestamps based on neighboring timestamps.

I'm not doing development of this package at this general time (busy with other things), so I can't really predict exactly when the workaround would be published (maybe a few weeks, I don't know).

rotemdan avatar Jul 31 '24 03:07 rotemdan

Yeah I'll open an issue against whisper.cpp as well; hopefully they'll fix it on their end! Thanks for taking a look

smoores-dev avatar Jul 31 '24 03:07 smoores-dev

Would it be easier if I were to open a PR that attempted to work around this as you described, by looking at the timestamps/offsets of the surrounding tokens? I know that PR review can also be quite a bit of work, so no worries if you'd rather handle it yourself! I was just reminded of the monstrous number of open issues against the whisper.cpp repo haha

smoores-dev avatar Jul 31 '24 03:07 smoores-dev

I don't think I need or want pull requests (so far I've closed the two that I got). This has been a personal project of mine. Maybe I'd prefer to keep the code 100% my own for now.

Even if I get the code, I can't guarantee when it is going to be published since I have other partially committed code destined for the next release.

Also, testing it works correctly may take more time than actually writing the code.

So, no need for pull request. I can try to quickly write and test a workaround locally, but it's not likely to be published during the next week (or possibly a bit more than).

rotemdan avatar Jul 31 '24 03:07 rotemdan

Understood, sounds good!

smoores-dev avatar Jul 31 '24 03:07 smoores-dev

Although I couldn't reproduce this, I added a workaround for this issue on the new 1.6.0 release. The source diff is here.

New version also supports the new large-v3-turbo model (based on latest whisper.cpp repository state).

Let me know if you encounter any other issues.

rotemdan avatar Oct 04 '24 07:10 rotemdan

This is awesome, thanks so much! Storyteller folks are very excited about the large-v3-turbo model haha

smoores-dev avatar Nov 14 '24 04:11 smoores-dev