pywhispercpp icon indicating copy to clipboard operation
pywhispercpp copied to clipboard

Potential UTF-8 / Latin-1 regression

Open UsernamesLame opened this issue 1 year ago • 4 comments

[2024-09-20 09:12:13,861] {model.py:132} INFO - Transcribing ...

Traceback (most recent call last):
  File "/Users/user/Desktop/whisper-metal/__main__.py", line 4, in <module>
    segments = model.transcribe('file.mp3')
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/user/Desktop/whisper-metal/.venv/lib/python3.12/site-packages/pywhispercpp/model.py", line 133, in transcribe
    res = self._transcribe(audio, n_processors=n_processors)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/user/Desktop/whisper-metal/.venv/lib/python3.12/site-packages/pywhispercpp/model.py", line 249, in _transcribe
    res = Model._get_segments(self._ctx, 0, n)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/user/Desktop/whisper-metal/.venv/lib/python3.12/site-packages/pywhispercpp/model.py", line 154, in _get_segments
    text = pw.whisper_full_get_segment_text(ctx, i)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
UnicodeDecodeError: 'utf-8' codec can't decode bytes in position 57-58: invalid continuation byte

I think we have a regression!

Originally posted in https://github.com/abdeladim-s/pywhispercpp/issues/59#issuecomment-2363723366

Just generating a separate issue so we don't disrupt that thread. @abdeladim-s I'm assuming this has something to do with dropping pydub. Are we not normalizing values anymore?

UsernamesLame avatar Sep 20 '24 13:09 UsernamesLame

Uninstalled pywhispercpp I installed from git and re-installed from pip, and the regression is gone, but so is CoreML.

Also CoreML is a lot slower than CPU inference on M1 Pro in macOS Sequoia.

UsernamesLame avatar Sep 20 '24 13:09 UsernamesLame

@abdeladim-s Wanna follow up on this? or should I consider it a one off?

UsernamesLame avatar Oct 07 '24 16:10 UsernamesLame

@UsernamesLame, I though you were following in #59, the issue was that the dylib files were not included in the wheel. I think the new build resolved the issue!

absadiki avatar Oct 08 '24 00:10 absadiki

I had the same issue when running whispercpp command directly with python before and think it is better to have pywhispercpp to return raw bytes and do the byte conversion to string on python instead, which you can replace the invalid unicode with python, created an issue and a potential PR for this here https://github.com/absadiki/pywhispercpp/issues/92

andrewchen5678 avatar Dec 28 '24 22:12 andrewchen5678