Whisper support
Initial support for Whisper, able to load and infer but the outputs are trashed, example script, https://github.com/mesolitica/vllm-whisper/blob/main/examples/whisper_example.py, might be bugs related to weights or attention, few hiccups,
- still try to figure out kv cache for Encoder hidden state or else each steps will recompute Encoder hidden state.
- No non causal attention for Encoder and Cross Attention in Decoder, seems like all attention implementation in VLLM is for causal, so I just use
xops.memory_efficient_attention_forwardlike T5 branch, this is not ideal because vLLM got their attention backend? - Reuse KV Cache Cross Attention from the first step for the next steps.
Thank you for the PR! @huseinzol05
Currently our infrastructure support for encode-decoder is still WIP (@robertgshaw2-neuralmagic should be able to provide more context here), so I think it's probably a good idea to work on whisper until the underlying infra is ready.
PRs for infrastructure are about to land
- https://github.com/vllm-project/vllm/pull/4888
- https://github.com/vllm-project/vllm/pull/4942
We are starting to build models on top of this (starting with BART to keep it simple)
Could you build this PR on top of #4942
cc @afeldman-nm
PRs for infrastructure are about to land
- [Kernel] Correctly invoke prefill & decode kernels for cross-attention (towards eventual encoder/decoder model support) #4888
- [Core] Subclass ModelRunner to support cross-attention & encoder sequences (towards eventual encoder/decoder model support) #4942
We are starting to build models on top of this (starting with BART to keep it simple)
Could you build this PR on top of #4942
cc @afeldman-nm
Seconding what @robertgshaw2-neuralmagic said - #4942 provides the support for encoder attention & cross-attention KV cache which Whisper will need.
I am planning to have BART working by EOD today or thereabouts, which can serve as an example of implementing an encoder/decoder model. Hoping to have all tests passing soon.
Hopefully you can try building your implementation of Whisper on top of #4942 , it would be great to know if you run into any issues.
At the level of kernel invocation, Attention.forward() now has an attn_type argument which consumes one of three possible AttentionType enum values: ENCODER (encoder attention), DECODER (decoder self-attention), ENCODER_DECODER (encoder/decoder cross-attention):
https://github.com/neuralmagic/nm-vllm/blob/a5c28fca8f5e21653c6e5874719467e08d3d8503/tests/kernels/test_encoder_decoder_attn.py#L697-L702
The following new attn_metadata fields enable the attn_type=ENCODER and attn_type=ENCODER_DECODER scenarios:
https://github.com/neuralmagic/nm-vllm/blob/a5c28fca8f5e21653c6e5874719467e08d3d8503/tests/kernels/utils.py#L862-L869
Specifically, cross_block_tables and cross_slot_mapping holds the block tables and slot mappings for the cross-attention KV cache.
There are also additional changes to support scheduling & adding requests for encoder/decoder models; you can see an example of invoking BART below:
[WIP] BART model invocation example https://github.com/neuralmagic/nm-vllm/blob/a5c28fca8f5e21653c6e5874719467e08d3d8503/examples/offline_inference_encoder_decoder.py
Some additional example code (links are to files in #4942 ):
[WIP] BART model implementation: https://github.com/neuralmagic/nm-vllm/blob/a5c28fca8f5e21653c6e5874719467e08d3d8503/vllm/model_executor/models/bart.py
[WIP] BART e2e test: compare output logits against HuggingFace implementation https://github.com/neuralmagic/nm-vllm/blob/a5c28fca8f5e21653c6e5874719467e08d3d8503/tests/models/test_bart.py
PRs for infrastructure are about to land
- [Kernel] Correctly invoke prefill & decode kernels for cross-attention (towards eventual encoder/decoder model support) #4888
- [Core] Subclass ModelRunner to support cross-attention & encoder sequences (towards eventual encoder/decoder model support) #4942
We are starting to build models on top of this (starting with BART to keep it simple)
Could you build this PR on top of #4942
cc @afeldman-nm
Sure! I would look at that branch
@afeldman-nm , let me solve the trashed outputs first, after that I will upstream to https://github.com/vllm-project/vllm/pull/4942
Solved trashed outputs and added cuda graph model
Streaming SRT format,
https://github.com/vllm-project/vllm/assets/19810909/05b65a96-6f9f-4919-ada9-64606ce5357b
Streaming JSON format,
https://github.com/vllm-project/vllm/assets/19810909/cc42b79f-9953-4e45-8eef-19a43ec9f02d
@robertgshaw2-neuralmagic we posted a blog about this, https://mesolitica.com/blog/vllm-whisper
@robertgshaw2-neuralmagic we posted a blog about this, https://mesolitica.com/blog/vllm-whisper
Hi @huseinzol05 this is great, I gave your blog a look.
FYI:
#4888 took a little longer than expected to land but it has been landed, enabling the xFormers backend to support encoder attention, decoder self-attention, and decoder cross-attention. #4837 and #4888 (both of which have been landed) were prerequisites for #4942 , which completes end-to-end support for encoder/decoder models with the xFormers backend & also introduces the BART model into vLLM. #4942 is still WIP but hoping to complete it soon.
@robertgshaw2-neuralmagic we posted a blog about this, https://mesolitica.com/blog/vllm-whisper
Hi @huseinzol05 this is great, I gave your blog a look.
FYI:
#4888 took a little longer than expected to land but it has been landed, enabling the xFormers backend to support encoder attention, decoder self-attention, and decoder cross-attention. #4837 and #4888 (both of which have been landed) were prerequisites for #4942 , which completes end-to-end support for encoder/decoder models with the xFormers backend & also introduces the BART model into vLLM. #4942 is still WIP but hoping to complete it soon.
Nice! anything you guys need help that i can help?
Thanks for your efforts Husein, will this implementation support continuous batching?
Initial support for Whisper, able to load and infer but the outputs are trashed, example script, https://github.com/mesolitica/vllm-whisper/blob/main/examples/whisper_example.py, might be bugs related to weights or attention, few hiccups,
- still try to figure out kv cache for Encoder hidden state or else each steps will recompute Encoder hidden state.
- No non causal attention for Encoder and Cross Attention in Decoder, seems like all attention implementation in VLLM is for causal, so I just use
xops.memory_efficient_attention_forwardlike T5 branch, this is not ideal because vLLM got their attention backend?- Reuse KV Cache Cross Attention from the first step for the next steps.
hi @huseinzol05, i use your report at https://github.com/mesolitica/vllm-whisper/ to host openai/whisper-large-v3 on my own machine using A100 80G but it not successful
The error
python3.10/site-packages/torch/nn/modules/conv.py", line 306, in _conv_forward return F.conv1d(input, weight, bias, self.stride, RuntimeError: GET was unable to find an engine to execute this computation
can you have any update to fix this or can you province your python library
Thank you.
Thanks for your efforts Husein, will this implementation support continuous batching?
Yes
Initial support for Whisper, able to load and infer but the outputs are trashed, example script, https://github.com/mesolitica/vllm-whisper/blob/main/examples/whisper_example.py, might be bugs related to weights or attention, few hiccups,
- still try to figure out kv cache for Encoder hidden state or else each steps will recompute Encoder hidden state.
- No non causal attention for Encoder and Cross Attention in Decoder, seems like all attention implementation in VLLM is for causal, so I just use
xops.memory_efficient_attention_forwardlike T5 branch, this is not ideal because vLLM got their attention backend?- Reuse KV Cache Cross Attention from the first step for the next steps.
hi @huseinzol05, i use your report at https://github.com/mesolitica/vllm-whisper/ to host openai/whisper-large-v3 on my own machine using A100 80G but it not successful
The error
python3.10/site-packages/torch/nn/modules/conv.py", line 306, in _conv_forward return F.conv1d(input, weight, bias, self.stride, RuntimeError: GET was unable to find an engine to execute this computation
can you have any update to fix this or can you province your python library
Thank you.
Below is my step to run,
pip3.10 install git+https://github.com/mesolitica/vllm-whisper
python3.10 -m vllm.entrypoints.openai.api_server --model openai/whisper-large-v3 --dtype bfloat16 --whisper-input-type input_features --max-model-len 448 --max-size-mb-whisper 100
wget https://github.com/mesolitica/malaya-speech/raw/master/speech/7021-79759-0004.wav
curl -X 'POST' 'http://localhost:8000/audio/transcriptions' \
-H 'accept: application/json' \
-H 'Content-Type: multipart/form-data' \
-F '[email protected];type=audio/mpeg' \
-F 'model=whisper' \
-F 'response_format=json' \
-F 'stream=true'
output,
data: {"token": "<|en|><|0.0|>"}
data: {"token": " without"}
data: {"token": " going"}
data: {"token": " to"}
data: {"token": " any"}
...
my pip freeze,
accelerate==0.32.1
aiohttp==3.9.5
aiosignal==1.3.1
annotated-types==0.7.0
anyio==3.7.1
argon2-cffi==23.1.0
argon2-cffi-bindings==21.2.0
asttokens==2.4.1
async-timeout==4.0.3
attrs==23.2.0
auto-gptq @ file:///home/ubuntu/AutoGPTQ/dist/auto_gptq-0.8.0.dev0%2Bcu1210-cp310-cp310-linux_x86_64.whl
beautifulsoup4==4.12.3
bleach==6.1.0
certifi==2024.7.4
cffi==1.16.0
chardet==3.0.4
charset-normalizer==3.3.2
click==8.1.7
cloudpickle==3.0.0
cmake==3.30.0
comm==0.2.2
datasets==2.20.0
dbus-python==1.2.16
debugpy==1.8.2
decorator==5.1.1
defusedxml==0.7.1
dill==0.3.8
diskcache==5.6.3
distro==1.9.0
distro-info==0.23+ubuntu1.1
dnspython==2.6.1
email_validator==2.2.0
exceptiongroup==1.2.1
executing==2.0.1
fastapi==0.111.0
fastapi-cli==0.0.4
fastjsonschema==2.20.0
filelock==3.15.4
frozenlist==1.4.1
fsspec==2024.5.0
gekko==1.2.1
h11==0.14.0
httpcore==1.0.5
httptools==0.6.1
httpx==0.27.0
huggingface-hub==0.23.4
idna==2.8
interegular==0.3.3
ipykernel==6.29.5
ipython==8.21.0
ipython-genutils==0.2.0
ipywidgets==8.1.3
jedi==0.19.1
Jinja2==3.1.4
jsonschema==4.22.0
jsonschema-specifications==2023.12.1
jupyter==1.0.0
jupyter-console==6.6.3
jupyter-server==1.18.0
jupyter-server-proxy==3.2.1
jupyter_client==8.6.2
jupyter_core==5.7.2
jupyterlab_pygments==0.3.0
jupyterlab_widgets==3.0.11
lark==1.1.9
llvmlite==0.43.0
lm-format-enforcer==0.10.1
markdown-it-py==3.0.0
MarkupSafe==2.1.5
matplotlib-inline==0.1.7
mdurl==0.1.2
mistune==3.0.2
mpmath==1.3.0
msgpack==1.0.8
multidict==6.0.5
multiprocess==0.70.16
nbclient==0.10.0
nbconvert==7.16.4
nbformat==5.10.4
nest-asyncio==1.6.0
networkx==3.3
ninja==1.11.1.1
notebook==6.4.12
numba==0.60.0
numpy==1.26.4
nvidia-cublas-cu12==12.1.3.1
nvidia-cuda-cupti-cu12==12.1.105
nvidia-cuda-nvrtc-cu12==12.1.105
nvidia-cuda-runtime-cu12==12.1.105
nvidia-cudnn-cu12==8.9.2.26
nvidia-cufft-cu12==11.0.2.54
nvidia-curand-cu12==10.3.2.106
nvidia-cusolver-cu12==11.4.5.107
nvidia-cusparse-cu12==12.1.0.106
nvidia-ml-py==12.555.43
nvidia-nccl-cu12==2.20.5
nvidia-nvjitlink-cu12==12.5.82
nvidia-nvtx-cu12==12.1.105
openai==1.35.13
orjson==3.10.6
outlines==0.0.46
packaging==24.1
pandas==2.2.2
pandocfilters==1.5.1
parso==0.8.4
peft==0.11.1
pexpect==4.9.0
pillow==10.4.0
platformdirs==4.2.2
prometheus-fastapi-instrumentator==7.0.0
prometheus_client==0.20.0
prompt_toolkit==3.0.47
protobuf==5.27.2
psutil==6.0.0
ptyprocess==0.7.0
pure-eval==0.2.2
py-cpuinfo==9.0.0
pyairports==2.1.1
pyarrow==16.1.0
pyarrow-hotfix==0.6
pycountry==24.6.1
pycparser==2.22
pydantic==2.8.2
pydantic_core==2.20.1
Pygments==2.18.0
PyGObject==3.36.0
python-apt==2.0.1+ubuntu0.20.4.1
python-dateutil==2.9.0.post0
python-dotenv==1.0.1
python-multipart==0.0.9
pytz==2024.1
PyYAML==6.0.1
pyzmq==26.0.3
qtconsole==5.5.2
QtPy==2.4.1
ray==2.32.0
referencing==0.35.1
regex==2024.5.15
requests==2.32.3
requests-unixsocket==0.2.0
rich==13.7.1
rouge==1.0.1
rpds-py==0.18.1
safetensors==0.4.3
Send2Trash==1.8.3
sentencepiece==0.2.0
shellingham==1.5.4
simpervisor==1.0.0
six==1.14.0
sniffio==1.3.1
soupsieve==2.5
stack-data==0.6.3
starlette==0.37.2
sympy==1.12.1
terminado==0.18.1
threadpoolctl==3.5.0
tiktoken==0.7.0
tinycss2==1.3.0
tokenizers==0.19.1
torch==2.3.0
torchaudio==2.3.0
torchvision==0.18.0
tornado==6.4.1
tqdm==4.66.4
traitlets==5.9.0
transformers==4.42.3
triton==2.3.0
typer==0.12.3
typing_extensions==4.12.2
tzdata==2024.1
ujson==5.10.0
unattended-upgrades==0.1
urllib3==2.2.2
uvicorn==0.30.1
uvloop==0.19.0
vllm @ git+https://github.com/mesolitica/vllm-whisper@fa81def0aab015cf183b662ea8cb2d89ab1be428
vllm-flash-attn==2.5.9
watchfiles==0.22.0
wcwidth==0.2.13
webencodings==0.5.1
websocket-client==1.8.0
websockets==12.0
widgetsnbextension==4.0.11
xformers==0.0.26.post1
xxhash==3.4.1
yarl==1.9.4
@huseinzol05 I created a new Conda environment and was able to load and infer using
python -m vllm.entrypoints.openai.api_server --model openai/whisper-large-v3 --dtype bfloat16 --whisper-input-type input_features --max-model-len 448 --max-size-mb-whisper 100 --gpu_memory_utilization=0.80. Works smoothly
I tried to use whisper_example.py from examples folder:
-
Trying
llm = LLM( model="openai/whisper-large-v3", max_num_seqs = 1, max_model_len = 448, gpu_memory_utilization = 0.4, dtype = 'bfloat16')succeds if you addwhisper_input_type="input_features" -
Generation,
output_lang = llm.generate( { "prompt_token_ids": [50258], "multi_modal_data": AudioData(y), }, sampling_params=SamplingParams(max_tokens=1, temperature=0), )fails with error "Multi-modal inputs are only supported by vision language models."
@dkakaie my bad, this whisper should not use multimodal interface, fixed the example, https://github.com/vllm-project/vllm/pull/5964/commits/ebf1cbfd77a42ef5772b1fbfa78c998620cc7e9e
@huseinzol05 your latest commit brought it into life, working like a charm. I don't know if, taking vLLM into account, its technically possible to have token/word level timestamps?
Hi @huseinzol05 Thanks for the whisper support and the example.
Can you make whisper_example.py also support long audio (> 30 seconds)? The example currently works up to the first 30 sec of a long audio.
Long audios works smoothly when running the application.
Hi @huseinzol05 Thanks for the whisper support and the example.
Can you make
whisper_example.pyalso support long audio (> 30 seconds)? The example currently works up to the first 30 sec of a long audio.Long audios works smoothly when running the application.
Long audio has all different kind of strategies. It's not really a integrated part of the "model" part of Whisper. It's also rather complicated and I am not sure how well it fits with vLLMs other abstractions. I think Whisper for longform has enough exceptions to LLMs that it might be better to implement is as decoupled as possible for the time being, plus Whisper is somewhat overdue and probably replaced in the semi-short term. I don't think that many of Whispers particularities will be very relevant in the future.
@afeldman-nm sorry to tag you, but I think this is something for the vLLM team to consider. Not sure what is currently implemented, but option 2 below plus always use "<|notimestamps|>" as a forced token is probably vLLMs best fit.
I think there are basically 4 options for long form chunking:
- Decoding as described in the Whisper paper: use the last decoder tag to see where the model stopped and feed in the audio again from that point on.
- Sliding window without overlap.
- Sliding window with overlap.
- Use VAD to create chunks.
In my (subjective) experience, although I might dinged up some implementations/evaluations:
- Quality-wise it's 4 > 3 > 1 > 2. Probably because 2 create windows that start in the middle of a sentence/word. 3 and 4 have quite some extra complexity; stitching together results for 3 and have another model+options for 4.
- Speed-wise it's 2/4 > 3 >>> 1. Mostly because 2, 3 and 4 are parallelizable. When using previous context in the decoder (matters less than you'd think), it's still a lot faster despite some parts needing to go sequential anyways.
- For complexity of implementation it's probably 2 > 1 > 4 > 3. For 3 it mostly depends on how the overlap should be stitched together again.
Hi @huseinzol05 Thanks for the whisper support and the example. Can you make
whisper_example.pyalso support long audio (> 30 seconds)? The example currently works up to the first 30 sec of a long audio. Long audios works smoothly when running the application.Long audio has all different kind of strategies. It's not really a integrated part of the "model" part of Whisper. It's also rather complicated and I am not sure how well it fits with vLLMs other abstractions. I think Whisper for longform has enough exceptions to LLMs that it might be better to implement is as decoupled as possible for the time being, plus Whisper is somewhat overdue and probably replaced in the semi-short term. I don't think that many of Whispers particularities will be very relevant in the future.
@afeldman-nm sorry to tag you, but I think this is something for the vLLM team to consider. Not sure what is currently implemented, but option 2 below plus always use "<|notimestamps|>" as a forced token is probably vLLMs best fit.
I think there are basically 4 options for long form chunking:
- Decoding as described in the Whisper paper: use the last decoder tag to see where the model stopped and feed in the audio again from that point on.
- Sliding window without overlap.
- Sliding window with overlap.
- Use VAD to create chunks.
In my (subjective) experience, although I might dinged up some implementations/evaluations:
- Quality-wise it's 4 > 3 > 1 > 2. Probably because 2 create windows that start in the middle of a sentence/word. 3 and 4 have quite some extra complexity; stitching together results for 3 and have another model+options for 4.
- Speed-wise it's 2/4 > 3 >>> 1. Mostly because 2, 3 and 4 are parallelizable. When using previous context in the decoder (matters less than you'd think), it's still a lot faster despite some parts needing to go sequential anyways.
- For complexity of implementation it's probably 2 > 1 > 4 > 3. For 3 it mostly depends on how the overlap should be stitched together again.
Thanks @MarktHart , I am will incorporate this into the encoder/decoder RFC I am working on. "Basic" encoder/decoder model support should land soon; the RFC covers the significant follow-on work involved in maturing encoder/decoder support to a degree that is commensurate with decoder support (i.e. adding more encoder/decoder models like Whisper, feature compatibility with encoder/decoder, etc.)
I have not studied the audio-length problem which you are discussing in-depth just yet, my guess is it will impact three key parts of the vLLM encoder/decoder model inference process:
- The semantics of submitting a request to vLLM (i.e. how does a single vLLM request map onto your "four basic options for long-form chunking")
- The information which a vLLM request must return to the caller in order to know where transcription left off
- The process of injecting control tokens (i.e. <|notimestamps|>, language choice, task, etc.) into Whisper decoder input during the autoregressive decoding process
Thoughts @MarktHart @huseinzol05 ?
CC @robertgshaw2-neuralmagic
@MarktHart if you use FastAPI entrypoints, it can process long audio, https://mesolitica.com/blog/vllm-whisper#Process-any-length-of-audio-using-Torchaudio
Basically it's a naive chunking or non-overlap sliding window, it use TorchAudio to stream audio segment for 1s, enough 30s, it will pass to whisper to decode, you can check the implementation at https://github.com/mesolitica/vllm-whisper/blob/main/vllm/entrypoints/openai/serving_whisper.py
- Feel free to add overlap sliding window.
- Feel free to parallelize the sliding windows, because the model support continuous batching.
@MarktHart if you use FastAPI entrypoints, it can process long audio, https://mesolitica.com/blog/vllm-whisper#Process-any-length-of-audio-using-Torchaudio
Basically it's a naive chunking or non-overlap sliding window, it use TorchAudio to stream audio segment for 1s, enough 30s, it will pass to whisper to decode, you can check the implementation at https://github.com/mesolitica/vllm-whisper/blob/main/vllm/entrypoints/openai/serving_whisper.py
- Feel free to add overlap sliding window.
- Feel free to parallelize the sliding windows, because the model support continuous batching.
Thanks @huseinzol05 I have tried naive chunking, it has good speed but caused a big increase in WER for long audios. Is it possible to implement a VAD-based batching? It requires an additional (VAD) model but works best with it as the model natively supports batching. Also eagerly waiting for the caching speed up of encoder outputs and cross attention layers.
@MarktHart if you use FastAPI entrypoints, it can process long audio, https://mesolitica.com/blog/vllm-whisper#Process-any-length-of-audio-using-Torchaudio Basically it's a naive chunking or non-overlap sliding window, it use TorchAudio to stream audio segment for 1s, enough 30s, it will pass to whisper to decode, you can check the implementation at https://github.com/mesolitica/vllm-whisper/blob/main/vllm/entrypoints/openai/serving_whisper.py
- Feel free to add overlap sliding window.
- Feel free to parallelize the sliding windows, because the model support continuous batching.
Thanks @huseinzol05 I have tried naive chunking, it has good speed but caused a big increase in WER for long audios. Is it possible to implement a VAD-based batching? It requires an additional (VAD) model but works best with it as the model natively supports batching. Also eagerly waiting for the caching speed up of encoder outputs and cross attention layers.
When I got free time, I will try to add overlap sliding window like HuggingFace implementation or VAD
Whisper tiny.en/small.en/medium.en aren't transcribing well (quality is low) Whisper tiny/small/medium are doing good. Can someone please explain this ?
Can you please have lora support for whisper?
Whisper tiny.en/small.en/medium.en aren't transcribing well (quality is low) Whisper tiny/small/medium are doing good. Can someone please explain this ?
If you check the source code, first we use to predict lang token, probably .en predicted different lang token so next tokens probably messed up.
Hello @huseinzol05 Thanks for your contribution.
I successfully started your fork at my A100 80gb gpu. Are you sure about continuous batching? Noticed that if I query with several different audios it messes output tokens between them.
@Temirulan messed in term? totally gibberish?
Hey, just checking in! Do you have any updates on the status of this pull request? Curious when it might be ready to merge. 😊
@Temirulan messed in term? totally gibberish?
Please share the inference code how you do the concurrency