pywhispercpp icon indicating copy to clipboard operation
pywhispercpp copied to clipboard

Malformed multi-byte UTF8 characters

Open raivisdejus opened this issue 7 months ago • 0 comments

As noted in https://github.com/ggerganov/whisper.cpp/issues/1798 sometimes a multi byte utf-8 character will be split in multiple tokens, some part in first token, some part is second.

Sample audio where this happens is here https://github.com/chidiwilliams/buzz/blob/main/testdata/whisper-latvian.wav

If we can get to the bytes of the segment "text" we can work around this by gluing two tokens if they have some issue. Current version will fail with UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc4 in position 0: unexpected end of data on whisper_full_get_segment_text call.

A solution could be to add some function like whisper_full_get_segment_bytes that would return raw bytes of the segment text for manual processing.

raivisdejus avatar Jul 16 '24 16:07 raivisdejus