Sasha Rush

Results 119 comments of Sasha Rush

Here's an example, adapted from https://srush.github.io/annotated-s4/#an-ssm-neural-network. ```python class SeqInternal(nn.Module): def setup(self): self.B = self.param("B", lecun_normal(), (self.N, 1)) # would love this be vmap'ped on bind self.K = slowfunction(self.B) def __call__(self,...

Neat thanks! I'll have to parse a bit why this hack works, but its neat that you can do it.

Oh, I'll fix this up and make sure they are compatible.

Yes, I ran the Triton kernel with BLOCK_SIZE=1024 (as shown above), but the asm["ptx"] that it produces still has `.maxntid 128, 1, 1`. Am I doing something wrong? Should BLOCK_SIZE...

Hmm, I'm confused. So if I want to run the output PTX from triton, that was originally block_size 1024, should I run CUDA blocks of 1024 / num_warps? What do...

Great, so that answers half my question. It sounds like CTA here for cuda should be 128 and that corresponds to 32 TpW * 4 WpCTA. But I still don't...

Oh, I think I get it now. If I have a Triton BLOCK_SIZE of 1024, I should still use in CUDA and it will correspond to the original Triton function....

This worked for me for simple functions, but now it seems to be failing for more complex functions. I'm getting weird bugs with alignment and outputs. Besides grid and shared...

Right, yeah I forget why I didn't add that. Need to think about what it means for with torch-style AD.

Amazing! I was just planning on doing this.