kyuubi icon indicating copy to clipboard operation
kyuubi copied to clipboard

[KYUUBI #5402] Introduce Spark JVM quake plugin

Open yoock opened this issue 1 year ago • 12 comments

:mag: Description

Issue References 🔗

This pull request fixes #5402

Describe Your Solution 🔧

When facing out-of-control memory management in Spark engine, we typically use JVMkill as a remedy by killing the process and generating a heap dump for post-analysis. However, even with jvmkill protection, we may still encounter issues caused by JVM running out of memory, such as repeated execution of Full GC without performing any useful work during the pause time. Since the JVM does not exhaust 100% of resources, JVMkill will not be triggered.

So introducing JVMQuake provides more granular monitoring of GC behavior, enabling early detection of memory management issues and facilitating fast failure. You can use the following configuration to enable jvmQuake plugins:

spark.plugins=org.apache.spark.kyuubi.jvm.quake.KyuubiJVMQuakePlugin
configuration default comment
spark.driver.jvmQuake.enabled false when true, enable driver jvmQuake
spark.executor.jvmQuake.enabled false when true, enable executor jvmQuake
spark.driver.jvmQuake.heapDump.enabled false when true, enable jvm heap dump when jvmQuake rearch the threshold
spark.executor.jvmQuake.heapDump.enabled false when true, enable jvm heap dump when jvmQuake rearch the threshold
spark.jvmQuake.dumpThreshold 100 The number of seconds to dump memory
spark.jvmQuake.killThreshold 200 The number of seconds to kill process
spark.jvmQuake.exitCode 502 The exit code of kill process
spark.jvmQuake.heapDumpPath /tmp/kyuubi_jvm_quake/apps The path of heap dump
spark.jvmQuake.checkInterval 3 The number of seconds to check jvmQuake
spark.jvmQuake.runTimeWeight 1.0 The weight of rum time

Types of changes :bookmark:

  • [ ] Bugfix (non-breaking change which fixes an issue)
  • [x] New feature (non-breaking change which adds functionality)
  • [ ] Breaking change (fix or feature that would cause existing functionality to change)

Test Plan 🧪

Behavior Without This Pull Request :coffin:

Behavior With This Pull Request :tada:

Related Unit Tests


Checklist 📝

Be nice. Be informative.

yoock avatar Jul 31 '24 03:07 yoock

Codecov Report

All modified and coverable lines are covered by tests :white_check_mark:

Project coverage is 0.00%. Comparing base (96ec132) to head (84361ce). Report is 12 commits behind head on master.

Additional details and impacted files
@@          Coverage Diff           @@
##           master   #6572   +/-   ##
======================================
  Coverage    0.00%   0.00%           
======================================
  Files         680     680           
  Lines       42045   42045           
  Branches     5739    5739           
======================================
  Misses      42045   42045           

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

codecov-commenter avatar Jul 31 '24 04:07 codecov-commenter

Can you help review the code @iodone

yoock avatar Aug 01 '24 01:08 yoock

@Yoock we need to add some user documentation to explain how to use this extension

iodone avatar Aug 01 '24 02:08 iodone

@yoock we need to add some user documentation to explain how to use this extension

User instructions have been added

yoock avatar Aug 01 '24 02:08 yoock

do you have any other questions? @yikf

yoock avatar Aug 05 '24 06:08 yoock

thanks @yoock for making this feature and ping, look good to me overall.

I commented a few minor questions and if there are no more I think we can move on to the next step.

yikf avatar Aug 05 '24 11:08 yikf

I have made modifications based on your suggestion and will see if there are any other suggestions @yikf

yoock avatar Aug 06 '24 12:08 yoock

Do you have any other questions? @bowenliang123

yoock avatar Aug 07 '24 12:08 yoock

Could you please help approve again? I accidentally initiated a request review earlier @bowenliang123

yoock avatar Aug 07 '24 12:08 yoock

Do you have any other questions? @pan3793

yoock avatar Aug 08 '24 02:08 yoock

when I searched the JVM Quake on Google and GitHub, I reached https://github.com/Netflix-Skunkworks/jvmquake, seems the proposed plugin uses a completely different mechanism and is also named JVM Quake, will this make ambiguous?

pan3793 avatar Aug 22 '24 09:08 pan3793

when I searched the JVM Quake on Google and GitHub, I reached https://github.com/Netflix-Skunkworks/jvmquake, seems the proposed plugin uses a completely different mechanism and is also named JVM Quake, will this make ambiguous?

@pan3793 After opening it up for a while, it is similar to our JVM Quake in that it detects unhealthy JVMs and then kills them. The algorithm is also similar, so it is more appropriate to call it Spark JVM Quake

image

image

yoock avatar Aug 23 '24 03:08 yoock

Are there any questions? @pan3793

yoock avatar Sep 02 '24 01:09 yoock

@yoock sorry for forgetting this one, it's already in good shape, let me merge it first.

pan3793 avatar Sep 02 '24 04:09 pan3793