jon-chuang comments

Results 438 comments of


                                            jon-chuang

[Core/Bugfix] Add FP8 K/V Scale and dtype conversion for prefix/prefill Triton Kernel

Yes, the comment should have been removed

[Core/Bugfix] Add FP8 K/V Scale and dtype conversion for prefix/prefill Triton Kernel

Yes @chenxu2048

[Feature]: Integrate `flash-infer` FP8 KV Cache Chunked-Prefill (Append Attention)

Actually, @comaniac, I noticed that there are explicit asserts forbidding use of flash infer kernels for chunked prefill https://github.com/vllm-project/vllm/blob/774cd1d3bf7890c6abae6c7ace798c4a376b2b20/vllm/attention/backends/flashinfer.py#L195 As pointed out in: https://github.com/flashinfer-ai/flashinfer/issues/392#issuecomment-2246997216 --- My understanding is that this...

[Feature]: Integrate `flash-infer` FP8 KV Cache Chunked-Prefill (Append Attention)

Anw, please assign it to me, I will investigate further

[Core] Fix sharing of stateful logits processors

> Forking of sequences during beam search: would a deep copy of the processor be feasible? Sounds like there should be a replay of the FSM, at the very least;...

Optional Flag to Prevent Python Code Execution

@joelonsql what is the relationship between `fn`, `@parameter` and `@nopython`?

[Usage]: How to config the parameters to support higher concurrency for deploying the qwen2-7b model as an API at 8-GPU A800 (80G) server?

I think you should just spawn more instances as follows ``` for X in 0..8 CUDA_VISIBLE_DEVICES=X python -m vllm.entrypoints.openai.api_server \ --model /data/models/Qwen2-7B-Instruct/ \ --served-model-name aaa-X \ --trust-remote-code \ --tensor-parallel-size 1...

[Usage]: How to config the parameters to support higher concurrency for deploying the qwen2-7b model as an API at 8-GPU A800 (80G) server?

Probably but if you want you can also use something like Ray serve, then you can scale out to multi node with single control plane entrypoint. You can also use...