vllm icon indicating copy to clipboard operation
vllm copied to clipboard

[Feature]: Tensor Parallelism with non divisble amount of attention heads

Open NadavShmayo opened this issue 9 months ago • 7 comments

🚀 The feature, motivation and pitch

I am trying to run a 70B model on a node with 3XA100-80Gi. 2XA100-80Gi does not contain enough VRAM to run the model, and when I try to run vLLM with tensor parallelism of 3, it raises an error saying that the number of attention heads is not divisble by 3.

I looked into changing the tensor parallelism feature so that it supports an uneven division of the tensors between GPUs. But I might be missing something here as there are a lot of validations in the codebase to avoid this scenario. Is it possible to implement tensor parallelism this way?

Alternatives

No response

Additional context

No response

NadavShmayo avatar May 23 '24 09:05 NadavShmayo