[Tentative] Adding 192 head dim (step_size = 12)
Not sure if this PR actually works correctly (I'm going to check it).
Deepseek models use head_dim=192, and this cannot be compiled because of this static assert.
This modification works by jumping like step_size=4 + jumping by 8 every iteration.
Let me know if this is interesting in here.
@Narsil May you write unit test for it? And you can ref https://github.com/zhyncs/dl/blob/master/flashinfer_build.sh to compile from source.
Which tests would you like me to add, batch_prefill_kernels ? others ?
Wdym ref to build from source ? I am building already.
Which tests would you like me to add, batch_prefill_kernels ? others ?
https://github.com/flashinfer-ai/flashinfer/blob/main/python/tests/test_batch_prefill_kernels.py https://github.com/flashinfer-ai/flashinfer/blob/main/python/tests/test_batch_decode_kernels.py
I am building already.
ok
Are the tests ran anywhere ?
Are the tests ran anywhere ?
There is currently no CI configured, you can use pytest in the local development environment to run.
Hi @Narsil , thanks for your contribution! You can add 192 to https://github.com/flashinfer-ai/flashinfer/blob/0d618712faff20a84bbd513d02ac01e16be19306/python/setup.py#L67
if compilation successes, you can run unittests such as https://github.com/flashinfer-ai/flashinfer/blob/0d618712faff20a84bbd513d02ac01e16be19306/python/tests/test_batch_prefill_kernels.py and see how does it work.
Hi @Narsil Any update?
cc @yzh119
Hi @zhyncs, I'll create a PR to support any head_dim that is divisible by 16, and 192 will be supported there.
While I appreciate the effort of this PR, I think the implementation is not correct because the step_size=8 actually aligns with the granularity of CUDA's cp.async instruction. But there is no such instruction that aligns with step_size=12.
Ok that's good. I'll close this PR for now. Thanks all! @yzh119 @Narsil