Jee Jee Li
Jee Jee Li
> > @badrjd Has this issue been resolved for you? If not, you can try adding `--max-seq-len-to-capture 48000`.It's most likely due to this reason > > is there any explanations?...
Indeed not supported, only in-flight quantization supports TP
Thanks for your contribution, have you tested the performance of autotune on models like llama?
@congcongchen123 you can refer to https://github.com/vllm-project/vllm/blob/main/tests/lora/test_llama_tp.py
We have a benchmark result at [slack_lora_thread]( https://vllm-dev.slack.com/archives/C07V3D6F493/p1739803031668149?thread_ts=1738948465.513199&cid=C07V3D6F493). We are aware of this issue and will optimizing the lora performance. Could you please provide your model and LoRA config?
I will test #14626 asap, and will provibe the test result here. @rtx-8000 @varun-sundar-rabindranath
 
> So what could be wrong for me to have slower performance for lower rank, especially for max_num_seqs (request rate) = 1 or no such big difference for higher rate?...