vllm icon indicating copy to clipboard operation
vllm copied to clipboard

[Misc]: Total number of attention heads (40) must be divisible by tensor parallel size (6)

Open CNXDZS opened this issue 2 months ago • 5 comments

Anything you want to discuss about vllm.

I want to deploy qwen-1.5-32B,but there is a problem:Total number of attention heads (40) must be divisible by tensor parallel size (6).how could I do to overcome this problem?my vllm version is 0.4.1+cu118,thanks a lot!

CNXDZS avatar Apr 21 '24 05:04 CNXDZS

use tensor parallel size 8 instead?

youkaichao avatar Apr 21 '24 06:04 youkaichao

use tensor parallel size 8 instead?

thanks for your help!but I only have 6 gpus(T4 16G per GPU), and tensor parallel seems can't be higher than gpus?

CNXDZS avatar Apr 21 '24 07:04 CNXDZS

Doing tp=4 is the most effective fix.

simon-mo avatar Apr 22 '24 16:04 simon-mo

Anything you want to discuss about vllm.

I want to deploy qwen-1.5-32B,but there is a problem:Total number of attention heads (40) must be divisible by tensor parallel size (6).how could I do to overcome this problem?my vllm version is 0.4.1+cu118,thanks a lot!

most effective way is: run 32b int4 32k on 2gpu with one vllm process. run 3 vllm processes on your 6gpu. and mapping processes into one port using e.g. oneapi. you can get 3 times throughput.

tutu329 avatar Apr 23 '24 00:04 tutu329

same error 32heads on 3gpus

eigen2017 avatar May 15 '24 10:05 eigen2017