vllm
vllm copied to clipboard
[Feature]: Tensor Parallelism with non divisble amount of attention heads
🚀 The feature, motivation and pitch
I am trying to run a 70B model on a node with 3XA100-80Gi. 2XA100-80Gi does not contain enough VRAM to run the model, and when I try to run vLLM with tensor parallelism of 3, it raises an error saying that the number of attention heads is not divisble by 3.
I looked into changing the tensor parallelism feature so that it supports an uneven division of the tensors between GPUs. But I might be missing something here as there are a lot of validations in the codebase to avoid this scenario. Is it possible to implement tensor parallelism this way?
Alternatives
No response
Additional context
No response