David MacLeod comments

Results 26 comments of


                                            David MacLeod

Rotary Embeddings + Triton

Ah sorry, I misinterpreted your proposal. Ok this makes sense, so we interleave the target memory addresses for each split of the tensor. Other than timing, is there any way...

Rotary Embeddings + Triton

Or thinking more about this > A lot of kernels can still work when the block size is a multiple of 32*num_warps. As an interim solution until slicing is supported,...

AOT compilation

Great news, is there some branch/PR we can track the progress of this?

AOT compilation

@ptillet I am very keen to have a go at using this feature whatever state the code currently is in, even if it is only the unit test you mentioned...

AOT compilation

> > > > We have a prototype that works with an old version of Triton. You might be able to hack it for your needs? #490 Great thanks @gaxler,...

AOT compilation

@gaxler should there be a correlation between the triton `BLOCK_SIZE` defined in the kernel definition, and the `gX`, `gY`, `gZ` defined in `GridWarps` when calling the kernel?

AOT compilation

Great thanks! I now have it working but have noticed the performance is much worse than the JIT triton equivalent. From the profile trace I see large gaps between the...

AOT compilation

If I know my target hardware apriori is there any downside/gotchas to me dumping the ptx code to a file and compiling down to cubin and loading that instead? Could...

AOT compilation

Converting to cubin has helped a lot! (in the trace the triton kernel is the one that sits between the orange and green) JIT AOT - PTX AOT - cubin...

AOT compilation

Tried caching the loaded CUFunction and things are now looking very close to JIT performance (only 5-10% slower now) 🙂