XiongfeiWei
XiongfeiWei
### Description Hi. I am extending the Pallas paged attention kernel. The case is a MQA. When I run my kernel, I encountered the following error which suggests it is...
This PR integrates the new ragged paged attention kernel with vLLM v1 on TPU. In particular, this PR - Update torch_xla pin to the latest - Update pallas.py in v1...
Use the optimized block sizes after tuning the kernel.
Reduce the size of block_table by getting rid of padding. Test plan: 1. $ VLLM_USE_V1=1 pytest -s -v vllm/tests/entrypoints/llm/test_accuracy.py::test_lm_eval_accuracy_v1_engine 2>&1 | tee out.txt 2. ``` VLLM_USE_V1=1 vllm serve meta-llama/Llama-3.1-8B-Instruct --disable-log-requests...
This PR enables gemma3-27b with TP>1 on multi-chips. Without the change, it fails with an error: ``` callstack: Traceback (most recent call last): File "/home/xiowei/vllm/vllm/v1/executor/multiproc_executor.py", line 465, in worker_busy_loop output...