Tri Dao comments

Results 429 comments of


                                            Tri Dao

trafficstars

Triton version is faster in both forward and backward when head dim is 64 but slower in both when head dim is 128

You can try FA3 too, which runs on A100 now. Btw triton bwd does not support causal=False, when you call it w causal=False it still runs w causal=True. You can...

Triton version is faster in both forward and backward when head dim is 64 but slower in both when head dim is 128

Most likely register spilling if I have to guess. You can try smaller block sizes to see if that helps.

Triton version is faster in both forward and backward when head dim is 64 but slower in both when head dim is 128

yeah then idk

misaligned address when using DropoutAddRMSNorm

@KimmiShi Can you post a short script to reproduce the error? Sth like ``` # Construct DropoutAddRMSNorm module # Generate q # Pass q to the module, get error ```...

misaligned address when using DropoutAddRMSNorm

It requires the last dimension to be multiple of 8, as mentioned in the README. We do call `.contiguous()` and check that dimension is divisible by 8. Maybe there's some...

misaligned address when using DropoutAddRMSNorm

If you can print out more info (shape, stride, dtype) of the input to DropoutAddRMSNorm that would also help me reproduce the error. e.g., before self.q_norm: ``` input = q.transpose(1,...

misaligned address when using DropoutAddRMSNorm

Thanks for the repro script, I've narrowed it down to a memory alignment problem. We expect all input tensors to be aligned to 16 bytes (in order to use vectorized...

misaligned address when using DropoutAddRMSNorm

I don't know a reliable way to get 16 bytes alignment, but I've posted a [question](https://discuss.pytorch.org/t/how-to-ensure-that-tensor-data-ptr-is-aligned-to-16-bytes/183440) to Pytorch forum.

misaligned address when using DropoutAddRMSNorm

I pushed a commit to (hopefully) make sure that memory addresses are aligned by 16 bytes by cloning the inputs.

misaligned address when using DropoutAddRMSNorm

Yes, I can reproduce it. I don't have the bandwidth right now to debug it. I'm not familiar with DeepSpeed, I suspect it puts all parameters in a buffer and...