David MacLeod

Results 26 comments of David MacLeod

Ah sorry, I misinterpreted your proposal. Ok this makes sense, so we interleave the target memory addresses for each split of the tensor. Other than timing, is there any way...

Or thinking more about this > A lot of kernels can still work when the block size is a multiple of 32*num_warps. As an interim solution until slicing is supported,...

Great news, is there some branch/PR we can track the progress of this?

@ptillet I am very keen to have a go at using this feature whatever state the code currently is in, even if it is only the unit test you mentioned...

> > > > We have a prototype that works with an old version of Triton. You might be able to hack it for your needs? #490 Great thanks @gaxler,...

@gaxler should there be a correlation between the triton `BLOCK_SIZE` defined in the kernel definition, and the `gX`, `gY`, `gZ` defined in `GridWarps` when calling the kernel?

Great thanks! I now have it working but have noticed the performance is much worse than the JIT triton equivalent. From the profile trace I see large gaps between the...

If I know my target hardware apriori is there any downside/gotchas to me dumping the ptx code to a file and compiling down to cubin and loading that instead? Could...

Converting to cubin has helped a lot! (in the trace the triton kernel is the one that sits between the orange and green) JIT AOT - PTX AOT - cubin...

Tried caching the loaded CUFunction and things are now looking very close to JIT performance (only 5-10% slower now) 🙂