kyuubi
kyuubi copied to clipboard
[KYUUBI #5402] Introduce Spark JVM quake plugin
:mag: Description
Issue References 🔗
This pull request fixes #5402
Describe Your Solution 🔧
When facing out-of-control memory management in Spark engine, we typically use JVMkill as a remedy by killing the process and generating a heap dump for post-analysis. However, even with jvmkill protection, we may still encounter issues caused by JVM running out of memory, such as repeated execution of Full GC without performing any useful work during the pause time. Since the JVM does not exhaust 100% of resources, JVMkill will not be triggered.
So introducing JVMQuake provides more granular monitoring of GC behavior, enabling early detection of memory management issues and facilitating fast failure. You can use the following configuration to enable jvmQuake plugins:
spark.plugins=org.apache.spark.kyuubi.jvm.quake.KyuubiJVMQuakePlugin
| configuration | default | comment |
|---|---|---|
| spark.driver.jvmQuake.enabled | false | when true, enable driver jvmQuake |
| spark.executor.jvmQuake.enabled | false | when true, enable executor jvmQuake |
| spark.driver.jvmQuake.heapDump.enabled | false | when true, enable jvm heap dump when jvmQuake rearch the threshold |
| spark.executor.jvmQuake.heapDump.enabled | false | when true, enable jvm heap dump when jvmQuake rearch the threshold |
| spark.jvmQuake.dumpThreshold | 100 | The number of seconds to dump memory |
| spark.jvmQuake.killThreshold | 200 | The number of seconds to kill process |
| spark.jvmQuake.exitCode | 502 | The exit code of kill process |
| spark.jvmQuake.heapDumpPath | /tmp/kyuubi_jvm_quake/apps | The path of heap dump |
| spark.jvmQuake.checkInterval | 3 | The number of seconds to check jvmQuake |
| spark.jvmQuake.runTimeWeight | 1.0 | The weight of rum time |
Types of changes :bookmark:
- [ ] Bugfix (non-breaking change which fixes an issue)
- [x] New feature (non-breaking change which adds functionality)
- [ ] Breaking change (fix or feature that would cause existing functionality to change)
Test Plan 🧪
Behavior Without This Pull Request :coffin:
Behavior With This Pull Request :tada:
Related Unit Tests
Checklist 📝
- [x] This patch was not authored or co-authored using Generative Tooling
Be nice. Be informative.
Codecov Report
All modified and coverable lines are covered by tests :white_check_mark:
Project coverage is 0.00%. Comparing base (
96ec132) to head (84361ce). Report is 12 commits behind head on master.
Additional details and impacted files
@@ Coverage Diff @@
## master #6572 +/- ##
======================================
Coverage 0.00% 0.00%
======================================
Files 680 680
Lines 42045 42045
Branches 5739 5739
======================================
Misses 42045 42045
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.
Can you help review the code @iodone
@Yoock we need to add some user documentation to explain how to use this extension
@yoock we need to add some user documentation to explain how to use this extension
User instructions have been added
do you have any other questions? @yikf
thanks @yoock for making this feature and ping, look good to me overall.
I commented a few minor questions and if there are no more I think we can move on to the next step.
I have made modifications based on your suggestion and will see if there are any other suggestions @yikf
Do you have any other questions? @bowenliang123
Could you please help approve again? I accidentally initiated a request review earlier @bowenliang123
Do you have any other questions? @pan3793
when I searched the JVM Quake on Google and GitHub, I reached https://github.com/Netflix-Skunkworks/jvmquake, seems the proposed plugin uses a completely different mechanism and is also named JVM Quake, will this make ambiguous?
when I searched the JVM Quake on Google and GitHub, I reached https://github.com/Netflix-Skunkworks/jvmquake, seems the proposed plugin uses a completely different mechanism and is also named JVM Quake, will this make ambiguous?
@pan3793 After opening it up for a while, it is similar to our JVM Quake in that it detects unhealthy JVMs and then kills them. The algorithm is also similar, so it is more appropriate to call it Spark JVM Quake
Are there any questions? @pan3793
@yoock sorry for forgetting this one, it's already in good shape, let me merge it first.