Murali Andoorveedu
Murali Andoorveedu
Hey @GindaChen there's a couple of things here, We haven't supported OPT yet, also the LLMEngine entry point won't work. We're only supporting AsyncLLMEngine right now
The way I would recommend is try with the online serving entrypoint with the LLaMa model. That'd be the best way to start playing around with it @GindaChen
LGTM - I guess one thing we can add is PP PyNCCL group
We only need point-to-point, blocking send and blocking recv only. It's not critical though unless `torch.distributed.*` ops don't work well with CUDA graph.
@SolitaryThinker Thanks for the thorough investigation and the fix! It's indeed true that there are existing issues with hanging on the current vLLM mainline, and I have not rebased on...
@SolitaryThinker I tried the model/commands above that are giving you issues. I was unable to reproduce on my setup. ### My Setup Started a fresh instance with the following: GCP...
@zhengxingmao Thanks for reporting this! Does this happen without PP? If not, I think it could be some interaction with the following flags with PP. ```--trust-remote-code --model /data/llvm/llama_weight --gpu-memory-utilization 0.60```...
@SolitaryThinker I did some investigation into what you were saying. I think there are real hangs that appear. I tried LLaMa 3 8B with effectively infinite request rate on 2...
Hi @zhengxingmao, thanks for trying it out. I think if `--trust-remote-code` does not work even on main branch can you file a bug against the repo? Not sure if that's...
Works for me ok with `python -m vllm.entrypoints.openai.api_server --model meta-llama/Meta-Llama-3-8B --port 8092 --enable-chunked-prefill --enforce-eager --pipeline-parallel-size 2 --trust-remote-code` > Qwen2 is also a very popular model at the moment, and I...