Murali Andoorveedu

Results 50 comments of Murali Andoorveedu

Hey @GindaChen there's a couple of things here, We haven't supported OPT yet, also the LLMEngine entry point won't work. We're only supporting AsyncLLMEngine right now

The way I would recommend is try with the online serving entrypoint with the LLaMa model. That'd be the best way to start playing around with it @GindaChen

LGTM - I guess one thing we can add is PP PyNCCL group

We only need point-to-point, blocking send and blocking recv only. It's not critical though unless `torch.distributed.*` ops don't work well with CUDA graph.

@SolitaryThinker Thanks for the thorough investigation and the fix! It's indeed true that there are existing issues with hanging on the current vLLM mainline, and I have not rebased on...

@SolitaryThinker I tried the model/commands above that are giving you issues. I was unable to reproduce on my setup. ### My Setup Started a fresh instance with the following: GCP...

@zhengxingmao Thanks for reporting this! Does this happen without PP? If not, I think it could be some interaction with the following flags with PP. ```--trust-remote-code --model /data/llvm/llama_weight --gpu-memory-utilization 0.60```...

@SolitaryThinker I did some investigation into what you were saying. I think there are real hangs that appear. I tried LLaMa 3 8B with effectively infinite request rate on 2...

Hi @zhengxingmao, thanks for trying it out. I think if `--trust-remote-code` does not work even on main branch can you file a bug against the repo? Not sure if that's...

Works for me ok with `python -m vllm.entrypoints.openai.api_server --model meta-llama/Meta-Llama-3-8B --port 8092 --enable-chunked-prefill --enforce-eager --pipeline-parallel-size 2 --trust-remote-code` > Qwen2 is also a very popular model at the moment, and I...