Cody Yu comments

Results 161 comments of


                                            Cody Yu

[Draft][Core] Refactor _prepare_model_input_tensors - take 2

@rkooo567 @zhuohan123 @simon-mo @WoosukKwon @youkaichao @LiuXiaoxuanPKU I've done the first round of refactoring: 1. The attention unrelated logic (tokens, sequence length, LoRA, MM, etc) remains in `prepare_input`. 2. Keep prefill...

[Core] offload model weights to CPU conditionally

I have the same question about how to get the same latency with offloading. Base on the code change, the offloaded weights are transferred to GPU synchronously when needed without...

[Core] offload model weights to CPU conditionally

Oh I guess the confusion comes from this statement: `When the percentage specified by the user is insufficient to hold the weights, the vLLM will continue to work with some...

[Core] offload model weights to CPU conditionally

The point is whatever the data transfer time is, it directly adds up to the forward latency without prefetching, and may become critical when inter-token latency during decoding is just...

[Feature]: Integrate `flash-infer` FP8 KV Cache Chunked-Prefill (Append Attention)

We are already working on this cc @Yard1

[Core][Speculative Decoding] Add multi-query verifier for speculative decoding without batch expansion

Took a brief look. The approach is ok in general, but I could see that some more overheads are introduced due to more tensors/logic being processed. I could review the...

[Core][Speculative Decoding] Add multi-query verifier for speculative decoding without batch expansion

> One alternative is to move this to a custom model runner, just for spec decode. Do you think that's better or worse than the current approach? In general I...

[RFC]: BatchLLM for better shared prefix utilizing in offline scenarios

Thanks for the RFC and this is indeed an useful feature for batch inference with common prefix. For the proposed changes: > 1. A preprocess part for the building of...

[RFC]: BatchLLM for better shared prefix utilizing in offline scenarios

I understand what you did to the requests. 1. The logic of figuring out the shared prefix and generating common prefix request can be done outside of the schedule. 2....

[RFC]: BatchLLM for better shared prefix utilizing in offline scenarios

> Could the way like changing the self.waiting of the scheduler in the other classes (in prefix-caching manager, for example) be regarded as "outside the engine"? If not, is there...