Woosuk Kwon
Woosuk Kwon
Hi @rahulbatra85 Thanks for the PR! Could you please provide performance benchmarks? Particularly, the perf benchmark in the Llama setting (head_size=128, num_kv_heads=8, etc.) would be useful.
@youkaichao Thanks for letting me know! Just fixed it.
> Can you elaborate on the custom Pallas kernel for PagedAttention? Is there any links? Good question. It's not open-sourced yet, but I was told that it will be released...
> Is this true after we moved to 1dquery? Or does it mean we need to support both 1d and 2d query inputs? @rkooo567 I believe the change won't affect...
cc @ruisearch42 @richardliaw @comaniac
@zhuohan123 Can you please take another look?
@22quinn Thanks for volunteering! Could you please submit a PR by EoW?
@akeshet We plan to re-design the API for that. We will probably not allow per-request logits processor (because this is too complex and slow). We are exploring other options. Please...
@maliknaik16 Please feel free to take it! @22quinn Let us know if you already have the PR.
@22quinn Oh great. Thanks!