Zhang, Liangang
Zhang, Liangang
1)Vertical split embedding to scale-out to much more ranks. 2)LAMB to enable large batch size.
# What does this PR do? Based on the latest cache design on [#PR26681](https://github.com/huggingface/transformers/pull/26681), This PR implements the Paged Attention KV cache which is proposed by this [paper](https://arxiv.org/pdf/2309.06180.pdf). Fixes #...
### Feature request Paged attention has been enabled by a lot of server engine, e.g., [vllm](https://github.com/vllm-project/vllm), [tensorrt-llm](https://github.com/NVIDIA/TensorRT-LLM/blob/release/0.5.0/tensorrt_llm/runtime/kv_cache_manager.py) ### Motivation KV cache is used to reduce computation for Decoder layer but...
Split PR #1480 to several smaller ones. this PR is to enable different device in the runtime.
Pytorch already support XPU device since 2.4 release and _xpu_ is also supported in OpenAI Trition. So, it should works with the Trition attention backend in SGLang. In this PR,...
Fixes #163543 cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @chenyang78 @kadeng @muchulee8 @amjames @chauhang @aakhundov @coconutruben
cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @kadeng @muchulee8 @amjames @chauhang @aakhundov @coconutruben @jataylo @chenyang78
# Motivation To improve quality on Intel XPU device, we plan to enable the CI/CD process on Intel XPUs. The CI is based on the docker env and we will...