MIOpen
MIOpen copied to clipboard
[MIOpenDriver] Print out perf config after MIOpenDriver failed after tuning
This is a debugging ticket brought up from @krzysz00 in MLIR team.
In using the MIOpenDriver to help MLIR tune different configs, it is common for MIOpen to pick up a different config than the heuristic config and validate the result. Occasionally when a different config generates a kernel that malfunctions, it results into a failure in MLIR CI. When this happens, it is hard for a developer who doesn't know much about MIOpen to reproduce this failure.
To improve this situation, we can implement either one of the following proposal:
- Add a input argument for MIOpenDriver to control the debug print to stdout
- Make the validation code to carry the state information: such that if this validation error come out after a tuning has been done, print out the debug print to stdout
it is common for MIOpen to pick up a different config than the heuristic config and validate the result. Occasionally when a different config generates a kernel that malfunctions,
MIOpen validates each TuningConfig by means of Solver::IsValidPerformanceConfig()
. It is the Solver who must guarantee that validated TuningConfig provides good kernels. So the root of the problem is a bug in the Solver.
When this happens, it is hard for a developer who doesn't know much about MIOpen to reproduce this failure.
Indeed, handling solver bugs require knowledge about MIOpen internals ;(
Add a input argument for MIOpenDriver to control the debug print to stdout
The Driver is unaware of tuning, so this won't help. However exporting MIOPEN_LOG_LEVEL=6
(or even MIOPEN_LOG_LEVEL=5
) is often enough to identify the failing TuningConfig.
However, it may be that I misunderstand the problem. Could you describe the use case in more detail? For example, list the exact actions of a developer who occasionally runs into this.
...it is hard ... to reproduce this failure.
Maybe the key question is how do you usually reproduce? Answer to this question would help me to better understand how can we deliver the necessary info to the user.
What we normally do (when we need to reproduce) is simply re-running the failing test on the local machine (with some instrumentation like logging etc).
How I'd like to reproduce the case where tuning generates a kernel that then has incorrect results:
- Look at the failure log for a failing kernel config
- Copy out the
conv_config
parameter that was passed to MLIR - Run
./bin/miopen-gen "--conv-config=[whatever the failing configuration was]" -pv_with_gpu | ./bin/mlir-miopen-driver -c | rocm-run
to confirm the failure (possibly with-pv
instead of-pv_with_gpu
and so on)
What I currently have to do to reproduce a failure that arises after tuning
- As before, look at the logs, but now searching for the arguments passed to
MIOpenDriver
- Re-compile
llvm-project-mlir
in a static library configuration - Copy the static library out to MIOpen's dependencies directory
- Build MIOpen
- Run the failing tuning config again with
MIOPEN_LOG_LEVEL=6
- Get the failing kernel config from a very noisy log
- Recompile MLIR into a shared library configuration to enable running internal tests
-
./bin/miopen-gen "--conv_config=[...]" -pv_with_gpu | ...
A whole lot of those steps are "toss out the build directory"-type recompiles, so a good several minutes each. Furthermore, since failures that only crop up at tuning are rare, none of this is scripted, so I can't even go have lunch in the middle of the process.
From the reproduce instructions I do not see how MIOpenDriver is engaged in this problem (and consequently, how the modification of the MIOpenDriver can help). The procedure does not even mention the driver, except "...but now searching for the arguments passed to MIOpenDriver". Is it so that ./bin/mlir-miopen-driver
invokes MIOpenDriver?
Hopefully the problem can be resolved at the mlir harness level. If not, then let's book a meeting and discuss.
Please also set the urgency and value labels for this ticket. Thanks!
/cc @junliume
I'm afraid I left out some context. We call MIOpenDriver to test our tuning support and to make sure we can be called from MIOpen - the CI eventually runs https://github.com/ROCmSoftwarePlatform/llvm-project-mlir/blob/miopen-dialect/mlir/utils/jenkins/miopen-tests/miopen_validate.sh to call MIOpenDriver
mlir-miopen-driver
is an unrelated tool that runs the code generation pipeline.
The quick solution could be like this: add to MIOpen an env var that enables logging of PerfConfigs in MLIR solvers. Use log level 4 (Warning). Logging should happen in GetSolution()
. Use that var in your script. Recommended name for var: MIOPEN_DEBUG_CONV_MLIR_LOG_TUNING_CONFIGS
. Please try and let me know if it works for you.
@jerryyin
This is a usability ticket
I think this is about debugging.
The Driver is unaware of tuning, so this won't help. However exporting MIOPEN_LOG_LEVEL=6 (or even MIOPEN_LOG_LEVEL=5) is often enough to identify the failing TuningConfig.
Hmmm that's right, I just realized that perf_config is an MIOpen internal thing, and that MIOpenDriver only call MIOpen API like a standalone application.
I think this is about debugging.
Updated the ticket description.
...add to MIOpen an env var that enables logging of PerfConfigs in MLIR solvers...
This seems like a reasonable way to do it. Once it is done, I believe the behavior will be:
- At tuning stage, all valid perf_config will be printed
- At testing stage, if a config is to fail, the last printed config is the fastest one picked by tuning (which the developer can go and use it directly)
Since this isn't the highest priority thing, I am placing it in my back burner for now.
@jerryyin Has this been resolved with latest ROCm 6.0.2 (HIP 6.0.32831)? Thanks!