vllm
vllm copied to clipboard
[Misc]: Total number of attention heads (40) must be divisible by tensor parallel size (6)
Anything you want to discuss about vllm.
I want to deploy qwen-1.5-32B,but there is a problem:Total number of attention heads (40) must be divisible by tensor parallel size (6).how could I do to overcome this problem?my vllm version is 0.4.1+cu118,thanks a lot!
use tensor parallel size 8 instead?
use tensor parallel size 8 instead?
thanks for your help!but I only have 6 gpus(T4 16G per GPU), and tensor parallel seems can't be higher than gpus?
Doing tp=4 is the most effective fix.
Anything you want to discuss about vllm.
I want to deploy qwen-1.5-32B,but there is a problem:Total number of attention heads (40) must be divisible by tensor parallel size (6).how could I do to overcome this problem?my vllm version is 0.4.1+cu118,thanks a lot!
most effective way is: run 32b int4 32k on 2gpu with one vllm process. run 3 vllm processes on your 6gpu. and mapping processes into one port using e.g. oneapi. you can get 3 times throughput.
same error 32heads on 3gpus
Is it possible to run a model with 64 heads on 3 GPUs? The model does not fit on 2 GPUs and I only have 3 on a single node?
I have the same problem. Spreading the load on any numbers of GPUs works well will with llama.cpp Looking at the code it seems vllm expects the same number of kv_heads for each GPU. Modifying this could prove difficult.
please check out https://docs.vllm.ai/en/stable/serving/distributed_serving.html :
if the model fits in a single node with multiple GPUs, but the number of GPUs cannot divide the model size evenly, you can use pipeline parallelism, which splits the model along layers and supports uneven splits. In this case, the tensor parallel size should be 1 and the pipeline parallel size should be the number of GPUs.
@youkaichao Thanks for your input!
I've tried your suggestion but could not make it work. To not pollute this issue with unrelated problems I made the issue #8270