Harry Mellor
Harry Mellor
I believe this is an old warning about AWQ in general. The fastest (and definitely optimised!) AWQ we have is `--quantization awq_marlin`.
I believe this should be automatically set though.
Hi @maxdebayser do you plan to continue this work?
Closing as stale. If you plan to continue this work, feel free to re-open.
@jikunshang do you plan to continue this work?
Great! In that case, could you remove the TODO from the docs regarding this feature?
Closing as stale
It is expected that the first and last rank will have higher memory usage because: - The first rank contains the input embeddings - The last rank contains the output...
The other half of this issue is that DeepSeek R1 has 61 hidden layers. Currently, if the number of hidden layers is not divisible by the pipeline world size, the...
I'll try and make a PR that handles this automatically, but in the meantime could you try setting `VLLM_PP_LAYER_PARTITION=7,8,8,8,8,8,7,7`?