Tri Dao
Tri Dao
max_seqlen_k is a variable on CPU. After the kernel is capture, changing this value will have no effect.
It's similar to other variables on CPU, such as softmax_scale. If the kernel is captured with softmax_scale = 1.0, then after that if you change softmax_scale to 2.0 and replay...
You're trying to change a CPU variable after capturing CUDA graph, that's not supported by CUDA graph. I haven't looked closely but looks like in this case the kernel is...
You'd want ``` auto tile_n = cute::gcd(cute::min(_32{}, size(TileShape_MNK{})), size(TileShape_MNK{})); ```
Wonderful work on the Triton implementation, and very thoughtful suggestions here. Thanks @janEbert! Yes, I'd love to stay up to date with upstream Triton, I just haven't had time to...
> I can take care of some of the integrations from upstream to here if you're fine with losing backward-compatibility. The attention mask/bias will probably not be integrated upstream due...
> Sorry, I've just edited the post above: My only worry is having to figure out the workarounds that had to be implemented here. Were they necessary to support the...
It's because there are people willing to put in the work to make it work for Hopper. There have yet to be people contributing to make it work for Turing.
It depends on folks contributing to make it work for Turing.
Thanks for this contribution. This is very impressive! However I think having different qk headdim and v headdim complicates the code and increases the maintenance workload. I believe it's better...