Cody Yu

Results 161 comments of Cody Yu

@rkooo567 @zhuohan123 @simon-mo @WoosukKwon @youkaichao @LiuXiaoxuanPKU I've done the first round of refactoring: 1. The attention unrelated logic (tokens, sequence length, LoRA, MM, etc) remains in `prepare_input`. 2. Keep prefill...

I have the same question about how to get the same latency with offloading. Base on the code change, the offloaded weights are transferred to GPU synchronously when needed without...

Oh I guess the confusion comes from this statement: `When the percentage specified by the user is insufficient to hold the weights, the vLLM will continue to work with some...

The point is whatever the data transfer time is, it directly adds up to the forward latency without prefetching, and may become critical when inter-token latency during decoding is just...

Took a brief look. The approach is ok in general, but I could see that some more overheads are introduced due to more tensors/logic being processed. I could review the...

> One alternative is to move this to a custom model runner, just for spec decode. Do you think that's better or worse than the current approach? In general I...

Thanks for the RFC and this is indeed an useful feature for batch inference with common prefix. For the proposed changes: > 1. A preprocess part for the building of...

I understand what you did to the requests. 1. The logic of figuring out the shared prefix and generating common prefix request can be done outside of the schedule. 2....

> Could the way like changing the self.waiting of the scheduler in the other classes (in prefix-caching manager, for example) be regarded as "outside the engine"? If not, is there...