vllm Whisper support

Initial support for Whisper, able to load and infer but the outputs are trashed, example script, https://github.com/mesolitica/vllm-whisper/blob/main/examples/whisper_example.py, might be bugs related to weights or attention, few hiccups,

still try to figure out kv cache for Encoder hidden state or else each steps will recompute Encoder hidden state.
No non causal attention for Encoder and Cross Attention in Decoder, seems like all attention implementation in VLLM is for causal, so I just use xops.memory_efficient_attention_forward like T5 branch, this is not ideal because vLLM got their attention backend?
Reuse KV Cache Cross Attention from the first step for the next steps.

Jun 28 '24 14:06 huseinzol05

Thank you for the PR! @huseinzol05

Currently our infrastructure support for encode-decoder is still WIP (@robertgshaw2-neuralmagic should be able to provide more context here), so I think it's probably a good idea to work on whisper until the underlying infra is ready.

Jun 28 '24 16:06 ywang96

PRs for infrastructure are about to land

https://github.com/vllm-project/vllm/pull/4888
https://github.com/vllm-project/vllm/pull/4942

We are starting to build models on top of this (starting with BART to keep it simple)

Could you build this PR on top of #4942

cc @afeldman-nm

Jun 28 '24 17:06 robertgshaw2-redhat

PRs for infrastructure are about to land

[Kernel] Correctly invoke prefill & decode kernels for cross-attention (towards eventual encoder/decoder model support) #4888

[Core] Subclass ModelRunner to support cross-attention & encoder sequences (towards eventual encoder/decoder model support) #4942

We are starting to build models on top of this (starting with BART to keep it simple)

Could you build this PR on top of #4942

cc @afeldman-nm

Seconding what @robertgshaw2-neuralmagic said - #4942 provides the support for encoder attention & cross-attention KV cache which Whisper will need.

I am planning to have BART working by EOD today or thereabouts, which can serve as an example of implementing an encoder/decoder model. Hoping to have all tests passing soon.

Hopefully you can try building your implementation of Whisper on top of #4942 , it would be great to know if you run into any issues.

At the level of kernel invocation, Attention.forward() now has an attn_type argument which consumes one of three possible AttentionType enum values: ENCODER (encoder attention), DECODER (decoder self-attention), ENCODER_DECODER (encoder/decoder cross-attention):

https://github.com/neuralmagic/nm-vllm/blob/a5c28fca8f5e21653c6e5874719467e08d3d8503/tests/kernels/test_encoder_decoder_attn.py#L697-L702

The following new attn_metadata fields enable the attn_type=ENCODER and attn_type=ENCODER_DECODER scenarios:

https://github.com/neuralmagic/nm-vllm/blob/a5c28fca8f5e21653c6e5874719467e08d3d8503/tests/kernels/utils.py#L862-L869

Specifically, cross_block_tables and cross_slot_mapping holds the block tables and slot mappings for the cross-attention KV cache.

There are also additional changes to support scheduling & adding requests for encoder/decoder models; you can see an example of invoking BART below:

[WIP] BART model invocation example https://github.com/neuralmagic/nm-vllm/blob/a5c28fca8f5e21653c6e5874719467e08d3d8503/examples/offline_inference_encoder_decoder.py

Some additional example code (links are to files in #4942 ):

[WIP] BART model implementation: https://github.com/neuralmagic/nm-vllm/blob/a5c28fca8f5e21653c6e5874719467e08d3d8503/vllm/model_executor/models/bart.py

[WIP] BART e2e test: compare output logits against HuggingFace implementation https://github.com/neuralmagic/nm-vllm/blob/a5c28fca8f5e21653c6e5874719467e08d3d8503/tests/models/test_bart.py

Jun 28 '24 17:06 afeldman-nm

PRs for infrastructure are about to land

[Kernel] Correctly invoke prefill & decode kernels for cross-attention (towards eventual encoder/decoder model support) #4888

[Core] Subclass ModelRunner to support cross-attention & encoder sequences (towards eventual encoder/decoder model support) #4942

We are starting to build models on top of this (starting with BART to keep it simple)

Could you build this PR on top of #4942

cc @afeldman-nm

Sure! I would look at that branch

Jun 28 '24 23:06 huseinzol05

@afeldman-nm , let me solve the trashed outputs first, after that I will upstream to https://github.com/vllm-project/vllm/pull/4942

Jun 29 '24 02:06 huseinzol05

Solved trashed outputs and added cuda graph model

Jul 01 '24 12:07 huseinzol05

Streaming SRT format,

https://github.com/vllm-project/vllm/assets/19810909/05b65a96-6f9f-4919-ada9-64606ce5357b

Streaming JSON format,

https://github.com/vllm-project/vllm/assets/19810909/cc42b79f-9953-4e45-8eef-19a43ec9f02d

Jul 02 '24 08:07 huseinzol05

@robertgshaw2-neuralmagic we posted a blog about this, https://mesolitica.com/blog/vllm-whisper

Jul 03 '24 06:07 huseinzol05

@robertgshaw2-neuralmagic we posted a blog about this, https://mesolitica.com/blog/vllm-whisper

Hi @huseinzol05 this is great, I gave your blog a look.

FYI:

#4888 took a little longer than expected to land but it has been landed, enabling the xFormers backend to support encoder attention, decoder self-attention, and decoder cross-attention. #4837 and #4888 (both of which have been landed) were prerequisites for #4942 , which completes end-to-end support for encoder/decoder models with the xFormers backend & also introduces the BART model into vLLM. #4942 is still WIP but hoping to complete it soon.

Jul 08 '24 21:07 afeldman-nm

@robertgshaw2-neuralmagic we posted a blog about this, https://mesolitica.com/blog/vllm-whisper

Hi @huseinzol05 this is great, I gave your blog a look.

FYI:

#4888 took a little longer than expected to land but it has been landed, enabling the xFormers backend to support encoder attention, decoder self-attention, and decoder cross-attention. #4837 and #4888 (both of which have been landed) were prerequisites for #4942 , which completes end-to-end support for encoder/decoder models with the xFormers backend & also introduces the BART model into vLLM. #4942 is still WIP but hoping to complete it soon.

Nice! anything you guys need help that i can help?

Jul 09 '24 02:07 huseinzol05

Thanks for your efforts Husein, will this implementation support continuous batching?

Jul 09 '24 08:07 MahmoudAshraf97

Initial support for Whisper, able to load and infer but the outputs are trashed, example script, https://github.com/mesolitica/vllm-whisper/blob/main/examples/whisper_example.py, might be bugs related to weights or attention, few hiccups,

still try to figure out kv cache for Encoder hidden state or else each steps will recompute Encoder hidden state.

No non causal attention for Encoder and Cross Attention in Decoder, seems like all attention implementation in VLLM is for causal, so I just use xops.memory_efficient_attention_forward like T5 branch, this is not ideal because vLLM got their attention backend?

Reuse KV Cache Cross Attention from the first step for the next steps.

hi @huseinzol05, i use your report at https://github.com/mesolitica/vllm-whisper/ to host openai/whisper-large-v3 on my own machine using A100 80G but it not successful

The error

python3.10/site-packages/torch/nn/modules/conv.py", line 306, in _conv_forward return F.conv1d(input, weight, bias, self.stride, RuntimeError: GET was unable to find an engine to execute this computation

can you have any update to fix this or can you province your python library

Thank you.

Jul 11 '24 08:07 AlexBlack2202

Thanks for your efforts Husein, will this implementation support continuous batching?

Yes

Jul 12 '24 02:07 huseinzol05

Initial support for Whisper, able to load and infer but the outputs are trashed, example script, https://github.com/mesolitica/vllm-whisper/blob/main/examples/whisper_example.py, might be bugs related to weights or attention, few hiccups,

still try to figure out kv cache for Encoder hidden state or else each steps will recompute Encoder hidden state.

No non causal attention for Encoder and Cross Attention in Decoder, seems like all attention implementation in VLLM is for causal, so I just use xops.memory_efficient_attention_forward like T5 branch, this is not ideal because vLLM got their attention backend?

Reuse KV Cache Cross Attention from the first step for the next steps.

hi @huseinzol05, i use your report at https://github.com/mesolitica/vllm-whisper/ to host openai/whisper-large-v3 on my own machine using A100 80G but it not successful

The error

python3.10/site-packages/torch/nn/modules/conv.py", line 306, in _conv_forward return F.conv1d(input, weight, bias, self.stride, RuntimeError: GET was unable to find an engine to execute this computation

can you have any update to fix this or can you province your python library

Thank you.

Below is my step to run,

pip3.10 install git+https://github.com/mesolitica/vllm-whisper
python3.10 -m vllm.entrypoints.openai.api_server --model openai/whisper-large-v3 --dtype bfloat16 --whisper-input-type input_features --max-model-len 448 --max-size-mb-whisper 100
wget https://github.com/mesolitica/malaya-speech/raw/master/speech/7021-79759-0004.wav
curl -X 'POST' 'http://localhost:8000/audio/transcriptions' \
-H 'accept: application/json' \
-H 'Content-Type: multipart/form-data' \
-F '[email protected];type=audio/mpeg' \
-F 'model=whisper' \
-F 'response_format=json' \
-F 'stream=true'

output,

data: {"token": "<|en|><|0.0|>"}

data: {"token": " without"}

data: {"token": " going"}

data: {"token": " to"}

data: {"token": " any"}

...

my pip freeze,

accelerate==0.32.1
aiohttp==3.9.5
aiosignal==1.3.1
annotated-types==0.7.0
anyio==3.7.1
argon2-cffi==23.1.0
argon2-cffi-bindings==21.2.0
asttokens==2.4.1
async-timeout==4.0.3
attrs==23.2.0
auto-gptq @ file:///home/ubuntu/AutoGPTQ/dist/auto_gptq-0.8.0.dev0%2Bcu1210-cp310-cp310-linux_x86_64.whl
beautifulsoup4==4.12.3
bleach==6.1.0
certifi==2024.7.4
cffi==1.16.0
chardet==3.0.4
charset-normalizer==3.3.2
click==8.1.7
cloudpickle==3.0.0
cmake==3.30.0
comm==0.2.2
datasets==2.20.0
dbus-python==1.2.16
debugpy==1.8.2
decorator==5.1.1
defusedxml==0.7.1
dill==0.3.8
diskcache==5.6.3
distro==1.9.0
distro-info==0.23+ubuntu1.1
dnspython==2.6.1
email_validator==2.2.0
exceptiongroup==1.2.1
executing==2.0.1
fastapi==0.111.0
fastapi-cli==0.0.4
fastjsonschema==2.20.0
filelock==3.15.4
frozenlist==1.4.1
fsspec==2024.5.0
gekko==1.2.1
h11==0.14.0
httpcore==1.0.5
httptools==0.6.1
httpx==0.27.0
huggingface-hub==0.23.4
idna==2.8
interegular==0.3.3
ipykernel==6.29.5
ipython==8.21.0
ipython-genutils==0.2.0
ipywidgets==8.1.3
jedi==0.19.1
Jinja2==3.1.4
jsonschema==4.22.0
jsonschema-specifications==2023.12.1
jupyter==1.0.0
jupyter-console==6.6.3
jupyter-server==1.18.0
jupyter-server-proxy==3.2.1
jupyter_client==8.6.2
jupyter_core==5.7.2
jupyterlab_pygments==0.3.0
jupyterlab_widgets==3.0.11
lark==1.1.9
llvmlite==0.43.0
lm-format-enforcer==0.10.1
markdown-it-py==3.0.0
MarkupSafe==2.1.5
matplotlib-inline==0.1.7
mdurl==0.1.2
mistune==3.0.2
mpmath==1.3.0
msgpack==1.0.8
multidict==6.0.5
multiprocess==0.70.16
nbclient==0.10.0
nbconvert==7.16.4
nbformat==5.10.4
nest-asyncio==1.6.0
networkx==3.3
ninja==1.11.1.1
notebook==6.4.12
numba==0.60.0
numpy==1.26.4
nvidia-cublas-cu12==12.1.3.1
nvidia-cuda-cupti-cu12==12.1.105
nvidia-cuda-nvrtc-cu12==12.1.105
nvidia-cuda-runtime-cu12==12.1.105
nvidia-cudnn-cu12==8.9.2.26
nvidia-cufft-cu12==11.0.2.54
nvidia-curand-cu12==10.3.2.106
nvidia-cusolver-cu12==11.4.5.107
nvidia-cusparse-cu12==12.1.0.106
nvidia-ml-py==12.555.43
nvidia-nccl-cu12==2.20.5
nvidia-nvjitlink-cu12==12.5.82
nvidia-nvtx-cu12==12.1.105
openai==1.35.13
orjson==3.10.6
outlines==0.0.46
packaging==24.1
pandas==2.2.2
pandocfilters==1.5.1
parso==0.8.4
peft==0.11.1
pexpect==4.9.0
pillow==10.4.0
platformdirs==4.2.2
prometheus-fastapi-instrumentator==7.0.0
prometheus_client==0.20.0
prompt_toolkit==3.0.47
protobuf==5.27.2
psutil==6.0.0
ptyprocess==0.7.0
pure-eval==0.2.2
py-cpuinfo==9.0.0
pyairports==2.1.1
pyarrow==16.1.0
pyarrow-hotfix==0.6
pycountry==24.6.1
pycparser==2.22
pydantic==2.8.2
pydantic_core==2.20.1
Pygments==2.18.0
PyGObject==3.36.0
python-apt==2.0.1+ubuntu0.20.4.1
python-dateutil==2.9.0.post0
python-dotenv==1.0.1
python-multipart==0.0.9
pytz==2024.1
PyYAML==6.0.1
pyzmq==26.0.3
qtconsole==5.5.2
QtPy==2.4.1
ray==2.32.0
referencing==0.35.1
regex==2024.5.15
requests==2.32.3
requests-unixsocket==0.2.0
rich==13.7.1
rouge==1.0.1
rpds-py==0.18.1
safetensors==0.4.3
Send2Trash==1.8.3
sentencepiece==0.2.0
shellingham==1.5.4
simpervisor==1.0.0
six==1.14.0
sniffio==1.3.1
soupsieve==2.5
stack-data==0.6.3
starlette==0.37.2
sympy==1.12.1
terminado==0.18.1
threadpoolctl==3.5.0
tiktoken==0.7.0
tinycss2==1.3.0
tokenizers==0.19.1
torch==2.3.0
torchaudio==2.3.0
torchvision==0.18.0
tornado==6.4.1
tqdm==4.66.4
traitlets==5.9.0
transformers==4.42.3
triton==2.3.0
typer==0.12.3
typing_extensions==4.12.2
tzdata==2024.1
ujson==5.10.0
unattended-upgrades==0.1
urllib3==2.2.2
uvicorn==0.30.1
uvloop==0.19.0
vllm @ git+https://github.com/mesolitica/vllm-whisper@fa81def0aab015cf183b662ea8cb2d89ab1be428
vllm-flash-attn==2.5.9
watchfiles==0.22.0
wcwidth==0.2.13
webencodings==0.5.1
websocket-client==1.8.0
websockets==12.0
widgetsnbextension==4.0.11
xformers==0.0.26.post1
xxhash==3.4.1
yarl==1.9.4

Jul 12 '24 02:07 huseinzol05

@huseinzol05 I created a new Conda environment and was able to load and infer using python -m vllm.entrypoints.openai.api_server --model openai/whisper-large-v3 --dtype bfloat16 --whisper-input-type input_features --max-model-len 448 --max-size-mb-whisper 100 --gpu_memory_utilization=0.80. Works smoothly

I tried to use whisper_example.py from examples folder:

Trying llm = LLM( model="openai/whisper-large-v3", max_num_seqs = 1, max_model_len = 448, gpu_memory_utilization = 0.4, dtype = 'bfloat16') succeds if you add whisper_input_type="input_features"
Generation, output_lang = llm.generate( { "prompt_token_ids": [50258], "multi_modal_data": AudioData(y), }, sampling_params=SamplingParams(max_tokens=1, temperature=0), ) fails with error "Multi-modal inputs are only supported by vision language models."

Jul 28 '24 07:07 dkakaie

@dkakaie my bad, this whisper should not use multimodal interface, fixed the example, https://github.com/vllm-project/vllm/pull/5964/commits/ebf1cbfd77a42ef5772b1fbfa78c998620cc7e9e

Jul 28 '24 14:07 huseinzol05

@huseinzol05 your latest commit brought it into life, working like a charm. I don't know if, taking vLLM into account, its technically possible to have token/word level timestamps?

Jul 28 '24 18:07 dkakaie

Hi @huseinzol05 Thanks for the whisper support and the example.

Can you make whisper_example.py also support long audio (> 30 seconds)? The example currently works up to the first 30 sec of a long audio.

Long audios works smoothly when running the application.

Jul 29 '24 10:07 Jiltseb

Hi @huseinzol05 Thanks for the whisper support and the example.

Can you make whisper_example.py also support long audio (> 30 seconds)? The example currently works up to the first 30 sec of a long audio.

Long audios works smoothly when running the application.

Long audio has all different kind of strategies. It's not really a integrated part of the "model" part of Whisper. It's also rather complicated and I am not sure how well it fits with vLLMs other abstractions. I think Whisper for longform has enough exceptions to LLMs that it might be better to implement is as decoupled as possible for the time being, plus Whisper is somewhat overdue and probably replaced in the semi-short term. I don't think that many of Whispers particularities will be very relevant in the future.

@afeldman-nm sorry to tag you, but I think this is something for the vLLM team to consider. Not sure what is currently implemented, but option 2 below plus always use "<|notimestamps|>" as a forced token is probably vLLMs best fit.

I think there are basically 4 options for long form chunking:

Decoding as described in the Whisper paper: use the last decoder tag to see where the model stopped and feed in the audio again from that point on.
Sliding window without overlap.
Sliding window with overlap.
Use VAD to create chunks.

In my (subjective) experience, although I might dinged up some implementations/evaluations:

Quality-wise it's 4 > 3 > 1 > 2. Probably because 2 create windows that start in the middle of a sentence/word. 3 and 4 have quite some extra complexity; stitching together results for 3 and have another model+options for 4.
Speed-wise it's 2/4 > 3 >>> 1. Mostly because 2, 3 and 4 are parallelizable. When using previous context in the decoder (matters less than you'd think), it's still a lot faster despite some parts needing to go sequential anyways.
For complexity of implementation it's probably 2 > 1 > 4 > 3. For 3 it mostly depends on how the overlap should be stitched together again.

Aug 04 '24 11:08 MarktHart

Hi @huseinzol05 Thanks for the whisper support and the example. Can you make whisper_example.py also support long audio (> 30 seconds)? The example currently works up to the first 30 sec of a long audio. Long audios works smoothly when running the application.

Long audio has all different kind of strategies. It's not really a integrated part of the "model" part of Whisper. It's also rather complicated and I am not sure how well it fits with vLLMs other abstractions. I think Whisper for longform has enough exceptions to LLMs that it might be better to implement is as decoupled as possible for the time being, plus Whisper is somewhat overdue and probably replaced in the semi-short term. I don't think that many of Whispers particularities will be very relevant in the future.

@afeldman-nm sorry to tag you, but I think this is something for the vLLM team to consider. Not sure what is currently implemented, but option 2 below plus always use "<|notimestamps|>" as a forced token is probably vLLMs best fit.

I think there are basically 4 options for long form chunking:

Decoding as described in the Whisper paper: use the last decoder tag to see where the model stopped and feed in the audio again from that point on.

Sliding window without overlap.

Sliding window with overlap.

Use VAD to create chunks.

In my (subjective) experience, although I might dinged up some implementations/evaluations:

Quality-wise it's 4 > 3 > 1 > 2. Probably because 2 create windows that start in the middle of a sentence/word. 3 and 4 have quite some extra complexity; stitching together results for 3 and have another model+options for 4.

Speed-wise it's 2/4 > 3 >>> 1. Mostly because 2, 3 and 4 are parallelizable. When using previous context in the decoder (matters less than you'd think), it's still a lot faster despite some parts needing to go sequential anyways.

For complexity of implementation it's probably 2 > 1 > 4 > 3. For 3 it mostly depends on how the overlap should be stitched together again.

Thanks @MarktHart , I am will incorporate this into the encoder/decoder RFC I am working on. "Basic" encoder/decoder model support should land soon; the RFC covers the significant follow-on work involved in maturing encoder/decoder support to a degree that is commensurate with decoder support (i.e. adding more encoder/decoder models like Whisper, feature compatibility with encoder/decoder, etc.)

I have not studied the audio-length problem which you are discussing in-depth just yet, my guess is it will impact three key parts of the vLLM encoder/decoder model inference process:

The semantics of submitting a request to vLLM (i.e. how does a single vLLM request map onto your "four basic options for long-form chunking")
The information which a vLLM request must return to the caller in order to know where transcription left off
The process of injecting control tokens (i.e. <|notimestamps|>, language choice, task, etc.) into Whisper decoder input during the autoregressive decoding process

Thoughts @MarktHart @huseinzol05 ?

CC @robertgshaw2-neuralmagic

Aug 04 '24 19:08 afeldman-nm

@MarktHart if you use FastAPI entrypoints, it can process long audio, https://mesolitica.com/blog/vllm-whisper#Process-any-length-of-audio-using-Torchaudio

Basically it's a naive chunking or non-overlap sliding window, it use TorchAudio to stream audio segment for 1s, enough 30s, it will pass to whisper to decode, you can check the implementation at https://github.com/mesolitica/vllm-whisper/blob/main/vllm/entrypoints/openai/serving_whisper.py

Feel free to add overlap sliding window.
Feel free to parallelize the sliding windows, because the model support continuous batching.

Aug 05 '24 03:08 huseinzol05

@MarktHart if you use FastAPI entrypoints, it can process long audio, https://mesolitica.com/blog/vllm-whisper#Process-any-length-of-audio-using-Torchaudio

Basically it's a naive chunking or non-overlap sliding window, it use TorchAudio to stream audio segment for 1s, enough 30s, it will pass to whisper to decode, you can check the implementation at https://github.com/mesolitica/vllm-whisper/blob/main/vllm/entrypoints/openai/serving_whisper.py

Feel free to add overlap sliding window.

Feel free to parallelize the sliding windows, because the model support continuous batching.

Thanks @huseinzol05 I have tried naive chunking, it has good speed but caused a big increase in WER for long audios. Is it possible to implement a VAD-based batching? It requires an additional (VAD) model but works best with it as the model natively supports batching. Also eagerly waiting for the caching speed up of encoder outputs and cross attention layers.

Aug 05 '24 13:08 Jiltseb

@MarktHart if you use FastAPI entrypoints, it can process long audio, https://mesolitica.com/blog/vllm-whisper#Process-any-length-of-audio-using-Torchaudio Basically it's a naive chunking or non-overlap sliding window, it use TorchAudio to stream audio segment for 1s, enough 30s, it will pass to whisper to decode, you can check the implementation at https://github.com/mesolitica/vllm-whisper/blob/main/vllm/entrypoints/openai/serving_whisper.py

Feel free to add overlap sliding window.

Feel free to parallelize the sliding windows, because the model support continuous batching.

Thanks @huseinzol05 I have tried naive chunking, it has good speed but caused a big increase in WER for long audios. Is it possible to implement a VAD-based batching? It requires an additional (VAD) model but works best with it as the model natively supports batching. Also eagerly waiting for the caching speed up of encoder outputs and cross attention layers.

When I got free time, I will try to add overlap sliding window like HuggingFace implementation or VAD

Aug 05 '24 23:08 huseinzol05

Whisper tiny.en/small.en/medium.en aren't transcribing well (quality is low) Whisper tiny/small/medium are doing good. Can someone please explain this ?

Aug 16 '24 16:08 Jeevi10

Can you please have lora support for whisper?

Aug 16 '24 16:08 Jeevi10

Whisper tiny.en/small.en/medium.en aren't transcribing well (quality is low) Whisper tiny/small/medium are doing good. Can someone please explain this ?

If you check the source code, first we use to predict lang token, probably .en predicted different lang token so next tokens probably messed up.

Aug 17 '24 03:08 huseinzol05

Hello @huseinzol05 Thanks for your contribution.

I successfully started your fork at my A100 80gb gpu. Are you sure about continuous batching? Noticed that if I query with several different audios it messes output tokens between them.

Sep 06 '24 02:09 Temirulan

@Temirulan messed in term? totally gibberish?

Sep 21 '24 11:09 huseinzol05

Hey, just checking in! Do you have any updates on the status of this pull request? Curious when it might be ready to merge. 😊

Oct 07 '24 14:10 lebaudantoine

@Temirulan messed in term? totally gibberish?

Please share the inference code how you do the concurrency

Oct 09 '24 15:10 huseinzol05