William Lin
William Lin
Hi @andoorve, While benchmarking using your PR, I've consistently encountered engine timeouts with smaller models on setups far below total VRAM capacity, which might relate to the issues you've linked...
I see you are using multi-step, so could also be related to https://github.com/vllm-project/vllm/pull/8403, now merged.
I forgot to handle the multiproc case. Will make a PR. For now set `--worker-use-ray` to use the ray backend and it should work.
For the timeout issue try setting the env var: `VLLM_RPC_GET_DATA_TIMEOUT_MS=1800000`
You could try increasing the max batch size with `--max-num-seqs`. By default it is 256 which may be too small for fp8 8B
flashinfer+multi-step will be supported by this PR https://github.com/vllm-project/vllm/pull/7928
The PR is merged now.
Let me try to reproduce on my end and take a look. Meanwhile, @ashgold, @br3no could you please trying `--disable-async-output-proc` and see if that changes anything?
Don't think I can reproduce, but probably not cause my multi-step as it's disabled by default
cc @alexm-neuralmagic @megha95