William Lin

Results 76 comments of William Lin

Hi @andoorve, While benchmarking using your PR, I've consistently encountered engine timeouts with smaller models on setups far below total VRAM capacity, which might relate to the issues you've linked...

I see you are using multi-step, so could also be related to https://github.com/vllm-project/vllm/pull/8403, now merged.

I forgot to handle the multiproc case. Will make a PR. For now set `--worker-use-ray` to use the ray backend and it should work.

For the timeout issue try setting the env var: `VLLM_RPC_GET_DATA_TIMEOUT_MS=1800000`

You could try increasing the max batch size with `--max-num-seqs`. By default it is 256 which may be too small for fp8 8B

flashinfer+multi-step will be supported by this PR https://github.com/vllm-project/vllm/pull/7928

Let me try to reproduce on my end and take a look. Meanwhile, @ashgold, @br3no could you please trying `--disable-async-output-proc` and see if that changes anything?

Don't think I can reproduce, but probably not cause my multi-step as it's disabled by default