whisper-openvino
whisper-openvino copied to clipboard
Switch to tiktoken-based tokenizer
Thanks for creating this fork of whisper!
The latest code is failing for me as follows:
pip install git+https://github.com/zhuzilin/whisper-openvino.git
whisper --language en --model tiny test_data/at_the_time.wav
Traceback (most recent call last):
File "/home/azureuser/whisper-openvino-venv/bin/whisper", line 8, in <module>
sys.exit(cli())
File "/home/azureuser/whisper-openvino-venv/lib/python3.8/site-packages/whisper/transcribe.py", line 286, in cli
result = transcribe(model, audio_path, temperature=temperature, **args)
File "/home/azureuser/whisper-openvino-venv/lib/python3.8/site-packages/whisper/transcribe.py", line 171, in transcribe
result = decode_with_fallback(segment)[0]
File "/home/azureuser/whisper-openvino-venv/lib/python3.8/site-packages/whisper/transcribe.py", line 99, in decode_with_fallback
results = model.decode(segment, options)
File "/home/azureuser/whisper-openvino-venv/lib/python3.8/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/home/azureuser/whisper-openvino-venv/lib/python3.8/site-packages/whisper/decoding.py", line 695, in decode
result = DecodingTask(model, options).run(mel)
File "/home/azureuser/whisper-openvino-venv/lib/python3.8/site-packages/whisper/decoding.py", line 463, in __init__
self.sot_index: int = self.initial_tokens.index(tokenizer.sot)
ValueError: tuple.index(x): x not in tuple
The get_tokenizer
function and _get_single_token_id("<|startoftranscript|>")
in Tokenizer
disagree on the value of the sot
token: 50258
in the former, 50335
in the latter.
I've been able to fix this by bringing in the latest tokenizer.py
from upstream, along with the associated tiktoken dependency and token files. This PR contains those changes. It's not a full catch-up merge with upstream.