flashinfer icon indicating copy to clipboard operation
flashinfer copied to clipboard

[Tentative] Adding 192 head dim (step_size = 12)

Open Narsil opened this issue 1 year ago • 7 comments

Not sure if this PR actually works correctly (I'm going to check it).

Deepseek models use head_dim=192, and this cannot be compiled because of this static assert. This modification works by jumping like step_size=4 + jumping by 8 every iteration.

Let me know if this is interesting in here.

Narsil avatar Aug 19 '24 08:08 Narsil

@Narsil May you write unit test for it? And you can ref https://github.com/zhyncs/dl/blob/master/flashinfer_build.sh to compile from source.

zhyncs avatar Aug 19 '24 08:08 zhyncs

Which tests would you like me to add, batch_prefill_kernels ? others ?

Wdym ref to build from source ? I am building already.

Narsil avatar Aug 19 '24 08:08 Narsil

Which tests would you like me to add, batch_prefill_kernels ? others ?

https://github.com/flashinfer-ai/flashinfer/blob/main/python/tests/test_batch_prefill_kernels.py https://github.com/flashinfer-ai/flashinfer/blob/main/python/tests/test_batch_decode_kernels.py

I am building already.

ok

zhyncs avatar Aug 19 '24 08:08 zhyncs

Are the tests ran anywhere ?

Narsil avatar Aug 19 '24 08:08 Narsil

Are the tests ran anywhere ?

There is currently no CI configured, you can use pytest in the local development environment to run.

zhyncs avatar Aug 19 '24 09:08 zhyncs

Hi @Narsil , thanks for your contribution! You can add 192 to https://github.com/flashinfer-ai/flashinfer/blob/0d618712faff20a84bbd513d02ac01e16be19306/python/setup.py#L67

if compilation successes, you can run unittests such as https://github.com/flashinfer-ai/flashinfer/blob/0d618712faff20a84bbd513d02ac01e16be19306/python/tests/test_batch_prefill_kernels.py and see how does it work.

yzh119 avatar Aug 19 '24 16:08 yzh119

Hi @Narsil Any update?

zhyncs avatar Sep 19 '24 10:09 zhyncs

cc @yzh119

zhyncs avatar Oct 09 '24 17:10 zhyncs

Hi @zhyncs, I'll create a PR to support any head_dim that is divisible by 16, and 192 will be supported there.

While I appreciate the effort of this PR, I think the implementation is not correct because the step_size=8 actually aligns with the granularity of CUDA's cp.async instruction. But there is no such instruction that aligns with step_size=12.

yzh119 avatar Oct 09 '24 18:10 yzh119

Ok that's good. I'll close this PR for now. Thanks all! @yzh119 @Narsil

zhyncs avatar Oct 09 '24 18:10 zhyncs