spark-rapids
spark-rapids copied to clipboard
[FEA] Heap dump/stack trace on OOM logging policy
Providing heap dumps and stack traces on GPU OOM are ways to narrow down memory misuse. How many stack traces and heap dumps to output is not a clear choice. Is this debug dumping needed once? Do we want to do this each time an OOM is detected? etc.
This issue is to propose a policy we can use for these debug tools. Ideally the policy is consistent with both so it makes sense to the end user.