pywhispercpp Potential UTF-8 / Latin-1 regression

[2024-09-20 09:12:13,861] {model.py:132} INFO - Transcribing ...

Traceback (most recent call last):
  File "/Users/user/Desktop/whisper-metal/__main__.py", line 4, in <module>
    segments = model.transcribe('file.mp3')
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/user/Desktop/whisper-metal/.venv/lib/python3.12/site-packages/pywhispercpp/model.py", line 133, in transcribe
    res = self._transcribe(audio, n_processors=n_processors)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/user/Desktop/whisper-metal/.venv/lib/python3.12/site-packages/pywhispercpp/model.py", line 249, in _transcribe
    res = Model._get_segments(self._ctx, 0, n)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/user/Desktop/whisper-metal/.venv/lib/python3.12/site-packages/pywhispercpp/model.py", line 154, in _get_segments
    text = pw.whisper_full_get_segment_text(ctx, i)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
UnicodeDecodeError: 'utf-8' codec can't decode bytes in position 57-58: invalid continuation byte

I think we have a regression!

Originally posted in https://github.com/abdeladim-s/pywhispercpp/issues/59#issuecomment-2363723366

Just generating a separate issue so we don't disrupt that thread. @abdeladim-s I'm assuming this has something to do with dropping pydub. Are we not normalizing values anymore?

Sep 20 '24 13:09 UsernamesLame

Uninstalled pywhispercpp I installed from git and re-installed from pip, and the regression is gone, but so is CoreML.

Also CoreML is a lot slower than CPU inference on M1 Pro in macOS Sequoia.

Sep 20 '24 13:09 UsernamesLame

@abdeladim-s Wanna follow up on this? or should I consider it a one off?

Oct 07 '24 16:10 UsernamesLame

@UsernamesLame, I though you were following in #59, the issue was that the dylib files were not included in the wheel. I think the new build resolved the issue!

Oct 08 '24 00:10 absadiki

I had the same issue when running whispercpp command directly with python before and think it is better to have pywhispercpp to return raw bytes and do the byte conversion to string on python instead, which you can replace the invalid unicode with python, created an issue and a potential PR for this here https://github.com/absadiki/pywhispercpp/issues/92

Dec 28 '24 22:12 andrewchen5678