triton AOT compilation

Hi, I was just wondering if there had been any more thoughts on supporting AOT kernel compilation to allow execution outside of Python? Referencing https://github.com/openai/triton/issues/175

Dec 01 '22 13:12 david-macleod

We are waiting on a rewrite to be done

See: https://github.com/openai/triton/pull/490#issuecomment-1238752684

Dec 02 '22 08:12 gaxler

Nice! Do you have a rough estimation when it will be done?

Dec 02 '22 18:12 yufenglee

The rewrite will be done this months. There is some very basic aot that we made for unit testing purposes right now, but efforts on a more complex one will be able to resume after then.

Dec 02 '22 19:12 ptillet

And what will AoT compilation generate, a C/C++ API plus source/.so?

Dec 07 '22 04:12 yufenglee

Great news, is there some branch/PR we can track the progress of this?

Dec 11 '22 20:12 david-macleod

@ptillet I am very keen to have a go at using this feature whatever state the code currently is in, even if it is only the unit test you mentioned previously (have a time sensitive project which could benefit from AOT functionality)

Dec 19 '22 23:12 david-macleod

We have a prototype that works with an old version of Triton. You might be able to hack it for your needs? https://github.com/openai/triton/pull/490

Dec 20 '22 00:12 gaxler

And what will AoT compilation generate, a C/C++ API plus source/.so?

For previous iterations we started with a C code that holds the kernels in source. The thinking is to give users something very general.

Dec 20 '22 00:12 gaxler

We have a prototype that works with an old version of Triton. You might be able to hack it for your needs? #490

Great thanks @gaxler, will give it a go! For the main feature is there any WIP branch that can be tracked or is it separate from the main repo?

Dec 20 '22 17:12 david-macleod

@gaxler should there be a correlation between the triton BLOCK_SIZE defined in the kernel definition, and the gX, gY, gZ defined in GridWarps when calling the kernel?

Jan 03 '23 15:01 david-macleod

@gaxler should there be a correlation between the triton BLOCK_SIZE defined in the kernel definition, and the gX, gY, gZ defined in GridWarps when calling the kernel?

You mean add grid size constrains at compile-time?

In general I avoided dealing with anything related to kernel launches in the draft PR, its all just placeholders to make it run

Jan 04 '23 05:01 gaxler

Great thanks! I now have it working but have noticed the performance is much worse than the JIT triton equivalent. From the profile trace I see large gaps between the triton kernel and the preceeding/successive kernels.

I am aware you are not actively maintaining this but was just wondering if this was expected or had any hints? I am not that familiar with PTX but understand it is JIT compiled so was wondering if it was not being cached correctly or something like that.

Jan 05 '23 19:01 david-macleod

sorry that you have to bump into all those things. this is just a POC and in no way optimized. thanks for profiling the generated code!!

probably the worst thing for the C code performance is the PTX. it gets compiled to binary every time you call a kernel. this will be replaced by a cubin.

another overhead might be the dispatch for different input sizes. not sure how significant it is for overall performance.

perhaps you can use several cuda streams to bypass those issues?

Jan 05 '23 20:01 gaxler

If I know my target hardware apriori is there any downside/gotchas to me dumping the ptx code to a file and compiling down to cubin and loading that instead? Could that potentially help with the overheads?

Jan 05 '23 22:01 david-macleod

Converting to cubin has helped a lot! (in the trace the triton kernel is the one that sits between the orange and green)

JIT

AOT - PTX

AOT - cubin

Whilst the overhead is now much smaller, there is still a gap in utilization before and after the AOT triton kernel is run (perhaps there is some implicit synchronisation happening).

Regarding your suggestion about the dispatch time, I am guessing that could result in a delay on host thread but as long as it is launched sufficiently before the device is ready to execute the kernel (which we are pretty sure is the case here), that cost should be hidden?

EDIT: I now think the overheads might be related to the module loading, need to confirm

Jan 06 '23 20:01 david-macleod

Assuming the _tr... is a triton JITFunction for JIT and the launch function from the generated C code for AOT.

I think you are correct. The JITFunction does the module and function loading before it calls the launch code. For the generated C code each call loads the module and the CUFunction.

Thanks for doing this, this will be helpful when thinking about optimizing the generated code!

Jan 06 '23 23:01 gaxler

Tried caching the loaded CUFunction and things are now looking very close to JIT performance (only 5-10% slower now) 🙂

Jan 07 '23 21:01 david-macleod

Got a new prototype together, maybe this can help in some way: https://github.com/openai/triton/pull/1056

Jan 14 '23 07:01 gaxler

Thanks, will check it out

Jan 24 '23 22:01 david-macleod

Do you know how close it is to being merged? (just trying to gauge whether I should wait - or working from the branch)

Apr 03 '23 09:04 david-macleod

It's pretty close but there are other things that have priority over merging it. So branch will be better. I'm happy to help, it will be great to get user feedback

Apr 05 '23 23:04 gaxler

@gaxler what is the relationship between this branch and aot.py on master? Will they both continue to exist after this branch is complete?

Apr 21 '23 20:04 david-macleod

triton triton copied to clipboard

AOT compilation

triton
triton copied to clipboard