Sasha Rush comments

Results 119 comments of


Sasha Rush

Scan with split parameters

Here's an example, adapted from https://srush.github.io/annotated-s4/#an-ssm-neural-network. ```python class SeqInternal(nn.Module): def setup(self): self.B = self.param("B", lecun_normal(), (self.N, 1)) # would love this be vmap'ped on bind self.K = slowfunction(self.B) def __call__(self,...

Scan with split parameters

Neat thanks! I'll have to parse a bit why this hack works, but its neat that you can do it.

NameError: name 'unit_y' is not defined

Oh, I'll fix this up and make sure they are compatible.

Exporting triton to PTX

Yes, I ran the Triton kernel with BLOCK_SIZE=1024 (as shown above), but the asm["ptx"] that it produces still has `.maxntid 128, 1, 1`. Am I doing something wrong? Should BLOCK_SIZE...

Exporting triton to PTX

Hmm, I'm confused. So if I want to run the output PTX from triton, that was originally block_size 1024, should I run CUDA blocks of 1024 / num_warps? What do...

Exporting triton to PTX

Great, so that answers half my question. It sounds like CTA here for cuda should be 128 and that corresponds to 32 TpW * 4 WpCTA. But I still don't...

Exporting triton to PTX

Oh, I think I get it now. If I have a Triton BLOCK_SIZE of 1024, I should still use in CUDA and it will correspond to the original Triton function....

Exporting triton to PTX

This worked for me for simple functions, but now it seems to be failing for more complex functions. I'm getting weird bugs with alignment and outputs. Besides grid and shared...

Python port

Right, yeah I forget why I didn't add that. Need to think about what it means for with torch-style AD.

Upgrade Tensor Puzzlers from torchtyping to jaxtyping.

Amazing! I was just planning on doing this.