whisper.cpp
whisper.cpp copied to clipboard
Generated vtt/srt/txt contains invalid UTF-8 character
Noticed rarely whisper cpp generates .txt, .vtt, and .vrt that contains invalid UTF-8 byte character. Would this be something expected from the whisper model that the user needs to handle?
>>> open('454912_614694.wav.vtt','rb').read().decode('utf-8')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x98 in position 31479: invalid start byte
>>> open('454912_614694.wav.vtt','rb').read()[31400:31500]
b'garam\n\n00:39:42.720 --> 00:39:44.880\n 10g garam\n\n00:39:44.880 --> 00:39:56.120\n\x98\n\n00:39:56.120 --> 0'
Note others using Python to read whisper.cpp output, can handle this via:
buffer = open('454912_614694.wav.vtt','rb').read().decode('utf-8', 'ignore')