jon-chuang

Results 438 comments of jon-chuang

Actually, @comaniac, I noticed that there are explicit asserts forbidding use of flash infer kernels for chunked prefill https://github.com/vllm-project/vllm/blob/774cd1d3bf7890c6abae6c7ace798c4a376b2b20/vllm/attention/backends/flashinfer.py#L195 As pointed out in: https://github.com/flashinfer-ai/flashinfer/issues/392#issuecomment-2246997216 --- My understanding is that this...

Anw, please assign it to me, I will investigate further

> Forking of sequences during beam search: would a deep copy of the processor be feasible? Sounds like there should be a replay of the FSM, at the very least;...

@joelonsql what is the relationship between `fn`, `@parameter` and `@nopython`?

I think you should just spawn more instances as follows ``` for X in 0..8 CUDA_VISIBLE_DEVICES=X python -m vllm.entrypoints.openai.api_server \ --model /data/models/Qwen2-7B-Instruct/ \ --served-model-name aaa-X \ --trust-remote-code \ --tensor-parallel-size 1...

Probably but if you want you can also use something like Ray serve, then you can scale out to multi node with single control plane entrypoint. You can also use...