vllm icon indicating copy to clipboard operation
vllm copied to clipboard

[Misc]: Total number of attention heads (40) must be divisible by tensor parallel size (6)

Open CNXDZS opened this issue 10 months ago • 5 comments

Anything you want to discuss about vllm.

I want to deploy qwen-1.5-32B,but there is a problem:Total number of attention heads (40) must be divisible by tensor parallel size (6).how could I do to overcome this problem?my vllm version is 0.4.1+cu118,thanks a lot!

CNXDZS avatar Apr 21 '24 05:04 CNXDZS

use tensor parallel size 8 instead?

youkaichao avatar Apr 21 '24 06:04 youkaichao

use tensor parallel size 8 instead?

thanks for your help!but I only have 6 gpus(T4 16G per GPU), and tensor parallel seems can't be higher than gpus?

CNXDZS avatar Apr 21 '24 07:04 CNXDZS

Doing tp=4 is the most effective fix.

simon-mo avatar Apr 22 '24 16:04 simon-mo

Anything you want to discuss about vllm.

I want to deploy qwen-1.5-32B,but there is a problem:Total number of attention heads (40) must be divisible by tensor parallel size (6).how could I do to overcome this problem?my vllm version is 0.4.1+cu118,thanks a lot!

most effective way is: run 32b int4 32k on 2gpu with one vllm process. run 3 vllm processes on your 6gpu. and mapping processes into one port using e.g. oneapi. you can get 3 times throughput.

tutu329 avatar Apr 23 '24 00:04 tutu329

same error 32heads on 3gpus

eigen2017 avatar May 15 '24 10:05 eigen2017

Is it possible to run a model with 64 heads on 3 GPUs? The model does not fit on 2 GPUs and I only have 3 on a single node?

YaliEkstein avatar Jul 04 '24 14:07 YaliEkstein

I have the same problem. Spreading the load on any numbers of GPUs works well will with llama.cpp Looking at the code it seems vllm expects the same number of kv_heads for each GPU. Modifying this could prove difficult.

leszekhanusz avatar Sep 07 '24 20:09 leszekhanusz

please check out https://docs.vllm.ai/en/stable/serving/distributed_serving.html :

if the model fits in a single node with multiple GPUs, but the number of GPUs cannot divide the model size evenly, you can use pipeline parallelism, which splits the model along layers and supports uneven splits. In this case, the tensor parallel size should be 1 and the pipeline parallel size should be the number of GPUs.

youkaichao avatar Sep 07 '24 20:09 youkaichao

@youkaichao Thanks for your input!

I've tried your suggestion but could not make it work. To not pollute this issue with unrelated problems I made the issue #8270

leszekhanusz avatar Sep 07 '24 21:09 leszekhanusz