Li, Jiang
Li, Jiang
Seems ```seq_lens``` in ```torch_sdpa.py``` should be replaced with ```seqlens```. I have verified the CPU backend with the model test, the change worked well.
@WoosukKwon Sure, please refer to #3654
@WoosukKwon Agree, I think this might be a good direction to try. For these element-wise operations and normalization operations, using ```torch.compile``` would unify the front-end to Python code and use...
@WoosukKwon Thanks for your comments! I have fixed most of them. For ```CPUModelRunner```, yes, you are right, isolate it with ```ModelRunner``` will avoid potential code breaks completely. We can do...
Hi @WoosukKwon Thanks for your further comments. I have fixed them all, please check, thanks.
Hi @WoosukKwon Thanks for your efforts to review this large PR! I have added a CI script for the CPU, with building and offline inference. It was deployed on vLLM...
Hi @markluofd the online inference of the CPU backend is still under tunning, we will enable it when it is ready.
@markluofd Yes, the performance may have some regression. Because the CPU inference thread pool(OpenMP), HTTP service thread pool, and tokenizer threads will scramble CPU cores. We plan to isolate the...
@markluofd FP16 will be cast to BF16 right now. BF16 is always supported even if there is no avx512_bf16 ISA. Pure FP16 support will be added soon, might be at...
Hi @ProExpertProg It is feasible to load different backends dylib at runtime. vLLM has multple backends with different dependencies and configurations, so it might be a lot of works to...