triton
triton copied to clipboard
[FRONTEND] Enable debug mode only when the options say so
Prior to this commit make_*
functions in the CUDABackend
enabled debug mode unconditionally. This led to slower-than-necessary compile times, even when the caller passed CUDAOptions
with debug=False
.
I mirrored this change in HIPBackend
for consistency.
When will options.debug
be True?
I am using Triton via jax_triton
, so I am not aware of the idiomatic Triton way of doing that. In jax_triton
users can construct cb.CUDAOptions(..., debug=True)
and pass that to each of the cb.CUDABackend.make_*
functions.
I am using Triton via jax_triton, so I am not aware of the idiomatic Triton way of doing that. In jax_triton users can construct cb.CUDAOptions(..., debug=True) and pass that to each of the cb.CUDABackend.make_* functions.
I'll think about how to refactor this. The debugging information was left on purpose. Some other users may want the debugging information there by default.
Thanks for reporting the increase in compilation time. That's definitely a concern
I appreciate that some users might prefer to see debug information, but I would recommend making debugging opt-in and documenting the process of enabling it on RTD. It could be as easy as setting an env variable.
To add a bit more context, we observe very slow compilation times internally with debug mode on and llvm/llvm-project#78228. LLVMTranslationInterface
emits a diagnostic for each FuncOp
parameter (because of the tt.divisibility
attribute), and diagnostics are expensive with printStackTraceOnDiagnostic
enabled.
I appreciate that some users might prefer to see debug information, but I would recommend making debugging opt-in and documenting the process of enabling it on RTD. It could be as easy as setting an env variable.
It will need more work to figure out the names of each "debug" mode. Right now TRITON_DEBUG
is mainly used to enable/disable assert only. And there are many additional debug flags in triton for different purposes.
How about TRITON_DEBUG_COMPILE
or even just re-using TRITON_DEBUG
to decide whether to enable debug mode in pass manager?
How about TRITON_DEBUG_COMPILE or even just re-using TRITON_DEBUG to decide whether to enable debug mode in pass manager?
I think we do want to separate these debugging modes. For example, it doesn't make sense if the user only want lineinfo while the compiler also adds device assert, if debugging modes are mixed.
We also noticed this issue on our internal version of Triton proper, running a newer LLVM version. test_core.py
crashes after trying to emit too many warnings about tt.divisibility
being unhandeled (coming from here: https://github.com/llvm/llvm-project/commit/fa6850a9981b65972294e13021f82b96d460b3ec#diff-c255e7b32bf400dde3e5bc44ba9aaf2e8789f02a70cf080de1877f925accd45bR117). This will become an issue here as well, once we want to update to a newer LLVM version.
We also noticed this issue on our internal version of Triton proper, running a newer LLVM version. test_core.py crashes after trying to emit too many warnings about tt.divisibility being unhandeled (coming from here: https://github.com/llvm/llvm-project/commit/fa6850a9981b65972294e13021f82b96d460b3ec#diff-c255e7b32bf400dde3e5bc44ba9aaf2e8789f02a70cf080de1877f925accd45bR117). This will become an issue here as well, once we want to update to a newer LLVM version.
If only tt.divisibility
attribute is a problem, can we implement LLVMTranslationDialectInterface
and then just emit success()
when encountering tt.divisibility
?
Alternatively, we can remove tt.divisibility
during the conversion from Triton::FuncOp
to LLVM::FuncOp
. Let me know if any of the solutions makes sense :)
The "fix" I used internally for now is https://github.com/openai/triton/commit/e3259831bfcbaef86143524bab1ce05bf5186387. Would this be something you're interested in upstreaming? This allows us to also speed up compile-times since we're not processing all of these warnings unless requested by the user, and can use multithreading.
But probably we should do both what you suggest, and do something with tt.divisibility
to not emit it, and the commit I linked. Sergey also has a thread on Slack to discuss what the best thing would be to do with tt.divisibility
here.
The "fix" I used internally for now is https://github.com/openai/triton/commit/e3259831bfcbaef86143524bab1ce05bf5186387. Would this be something you're interested in upstreaming?
Seems good to me!
The "fix" I used internally for now is e325983. Would this be something you're interested in upstreaming? This allows us to also speed up compile-times since we're not processing all of these warnings unless requested by the user, and can use multithreading.
But probably we should do both what you suggest, and do something with
tt.divisibility
to not emit it, and the commit I linked. Sergey also has a thread on Slack to discuss what the best thing would be to do withtt.divisibility
here.
Is the main problem that the recent change https://github.com/openai/triton/commit/48034034c86510a64d99a6bbf8ebcf0f1487681c started disabling multithreading all the time instead of only when dumping? I'm guessing that's what is causing the performance problem, we should definitely fix it
#3147 puts diagnostics behind an environment variable
@gflegar's PR should solve the problem. I'm closing the PR, feel free to reopen it if you think something is still needed.