Li, Jiang
Li, Jiang
This PR adds a new CPU backend to vLLM and supports the basic model inference feature, with BF16 and FP32 dtype. FP16 support and TP support will be added in...
Hi, vLLM genius @WoosukKwon @zhuohan123. Motivated by some requirements to execute vLLM on the CPU (e.g., #176 ), we recently implemented an initial prototype for CPU-only execution on the x86...
## Progress - [ ] Integrate CPU executor to support the basic model inference (BF16/FP32) without TP. - #3634 - #3824 - #4113 - [ ] Support FP16 model inference....
FILL IN THE PR DESCRIPTION HERE FIX #xxxx (*link existing issues this PR will resolve*) **BEFORE SUBMITTING, PLEASE READ THE CHECKLIST BELOW AND FILL IN THE DESCRIPTION ABOVE** --- PR...
For Trino: - ShortTimestamp (a Long member, 64 bits) - LongTimestamp (a Long member and an Int member, 96 bits) For Velox: two Long members (128 bits) ```Timestamp(Precision)``` type signature...
This PR enabled vLLM multiprocessing in CPU backend for improving async LLM engine performance and supporting TP. The main changes include: - Use utilities from ```vllm.executor.multiproc_worker_utils``` to manage workers in...
This PR provides corresponding CPU kernels of the compressed-tensor INT8 W8A8, based on oneDNN, to enable lowering compressed-tensor operations to CPU device. Both of the static and dynamic mode are...
Generate custom activation ops using ```torch.compile``` for CPU backend. Main changes to vLLM: - ~~Add ```_forward_native_impl``` to each custom ops to avoid recompilation caused by tracing ```self```.~~ For vicuna-7b-v1.5, there...
Upgrade CPU backend torch to 2.6.0, all tests are verified on local. Waiting for #12721