dan_the_3rd

Results 83 comments of dan_the_3rd

Yes it works when K2 < 32 or `K2 % 32 == 0` indeed. > We can fix it pretty quickly Oh cool! That would be great :)

Just wanted to follow-up on this quickly - do you have an ETA on when this could be available? Or maybe you can give me a few pointers to fix...

> please note that this doesn't change cutlass's requirement that THREAD_BLOCK_TILE_N % 32 == 0 due to shared memory loading patterns Yes that's totally understandable. `n=48` was just an example...

The corresponding PR: https://github.com/NVIDIA/cutlass/pull/590

Looking quickly through the code it looks great! Thanks a lot for putting that up so fast :) Unfortunately, I'll be away for the entire month of August. I can...

So I gave it a shot and it's working great! Thanks for putting this together :) There is just one issue when `problem_size_0_n > 32 && problem_size_0_n % 2 ==...

Hi @MatthieuTPHR - this looks like a great improvement! > Would it be possible to add a more optimised kernel for head-dim=40 which is the parameter used in stable diffusion....

@TheLastBen what is your GPU model? xformers supports architectures above sm60 (P100+) - and possibly above sm50 (untested). The most important speedups are achieved on GPUs with tensor cores (sm70+...

I believe the ints might be the random seed in case of dropout (which we don't use anyway). That's something we should be able to fix. Let's move the discussion...

Hi @C43H66N12O12S2 - thanks for the heads-up. Do you have some pointers to share on these "quality degradations" - is it something you have seen yourself? Do you have a...