yunikorn-core
yunikorn-core copied to clipboard
[YUNIKORN-1213] Configurable Health Checker Interval
What is this PR for?
The changes contained here allow for the configuration of the Health Check interval as well as the ability to disable the health checks outright.
The format for the ConfigMap
is as follows:
scheduler:
heathcheck:
enabled: true
interval: 30s
partitions:
...
What type of PR is it?
- [ ] - Bug Fix
- [ ] - Improvement
- [x] - Feature
- [ ] - Documentation
- [ ] - Hot Fix
- [ ] - Refactoring
Todos
- [x] - Extend
Config
with aHealthCheck
subsection underScheduler
. - [x] - Extend
health_checker
to support custom intervals and disabling of the health checks. - [ ] - Extend
health_checker
with messages to indicate disabled health checks. This is important so that a disabled health check is not confused with a more serious issue. - [ ] - RESTful
ConfigMap
updates should be able to update runtime configurations. - [x] - Fix
nil
pointer dereference caused byinit
passing anil
context intoNewHealthChecker
by setting to default values ifcontext
isnil
. - [ ] - Documentation updates.
What is the Jira issue?
How should this be tested?
Tests for all changes have been added to this PR. There is a need for additional deployment testing.
Questions:
- [x] - How would we like to handle the functions which return health check information? I can create an
isCheckEnabled
method that takes a*ClusterContext
and return abool
based on the settings in the configs. The issue is that not all the methods take a*ClusterContext
as a parameter.
hi @surahman a few high-level feedback:
- Instead of the following config
healthcheck:
enabled: %s
interval: 99s
partitions:
- name: default
queues:
- name: root
what was discussed looks like the following:
scheduler:
healthcheck:
enabled: true
interval: 30s
partitions:
- name: default
queues:
- name: root
We want to include all "global" (non-partition-specific) configs under a single parent "scheduler", including "checksum". I suggest using another JIRA to track the changes to the "checksum", just to make the PR smaller.
- Please create a JIRA to track the document changes.
- Please make sure the test case includes both having healthcheck section configured and not having this section.
Codecov Report
Merging #412 (d192d7e) into master (3ba91fb) will increase coverage by
0.09%
. The diff coverage is93.02%
.
@@ Coverage Diff @@
## master #412 +/- ##
==========================================
+ Coverage 69.09% 69.18% +0.09%
==========================================
Files 67 67
Lines 9654 9676 +22
==========================================
+ Hits 6670 6694 +24
+ Misses 2740 2738 -2
Partials 244 244
Impacted Files | Coverage Δ | |
---|---|---|
pkg/common/configs/config.go | 76.66% <ø> (ø) |
|
pkg/scheduler/scheduler.go | 0.00% <0.00%> (ø) |
|
pkg/scheduler/health_checker.go | 86.63% <95.23%> (+1.43%) |
:arrow_up: |
pkg/scheduler/objects/application.go | 58.77% <0.00%> (-0.15%) |
:arrow_down: |
pkg/scheduler/context.go | 31.61% <0.00%> (+0.60%) |
:arrow_up: |
Continue to review full report at Codecov.
Legend - Click here to learn more
Δ = absolute <relative> (impact)
,ø = not affected
,? = missing data
Powered by Codecov. Last update 3ba91fb...d192d7e. Read the comment docs.
- Instead of the following config
healthcheck: enabled: %s interval: 99s partitions: - name: default queues: - name: root
what was discussed looks like the following:
scheduler: healthcheck: enabled: true interval: 30s partitions: - name: default queues: - name: root
We want to include all "global" (non-partition-specific) configs under a single parent "scheduler", including "checksum". I suggest using another JIRA to track the changes to the "checksum", just to make the PR smaller.
Thank you for pointing this out, it was a misunderstanding on my part 🤦🏼♂️.
I have created a YUNIKORN-1220 for the checksum
.
- Please create a JIRA to track the document changes.
- Please make sure the test case includes both having healthcheck section configured and not having this section.
I did not include a test case for this scenario because this test should cover the default condition: https://github.com/apache/yunikorn-core/blob/aacdb887fb5ff02441d5c577d23698086d711f28/pkg/scheduler/health_checker_test.go#L64-L76
I can see the need for an explicit test to catch the case though and have added one.
I have examined all the functions for metrics generation in the HealthCheck
and they are all called locally from the runOnce
routine and the call tree that stems there. Given this, we can probably leave the routines as-is and add log messages at the Warn
level to indicate that health metric reporting is disabled.
There are also two methods that are exported and only ever called locally and I wonder if we can switch them to not be exported? https://github.com/apache/yunikorn-core/blob/aacdb887fb5ff02441d5c577d23698086d711f28/pkg/scheduler/health_checker.go#L129 https://github.com/apache/yunikorn-core/blob/aacdb887fb5ff02441d5c577d23698086d711f28/pkg/scheduler/health_checker.go#L146
What are your thoughts @yangwwei and @craigcondit?
I have examined all the functions for metrics generation in the
HealthCheck
and they are all called locally from therunOnce
routine and the call tree that stems there. Given this, we can probably leave the routines as-is and add log messages at theWarn
level to indicate that health metric reporting is disabled.
Let's not log anything if the reporting is disabled -- if it is, that's a conscious decision on the admin's part so let's not punish him by spamming the logs.
There are also two methods that are exported and only ever called locally and I wonder if we can switch them to not be exported?
https://github.com/apache/yunikorn-core/blob/aacdb887fb5ff02441d5c577d23698086d711f28/pkg/scheduler/health_checker.go#L129
https://github.com/apache/yunikorn-core/blob/aacdb887fb5ff02441d5c577d23698086d711f28/pkg/scheduler/health_checker.go#L146
Making these private should be fine.
CC @wilfred-s -- Wilfred, can you give some input on direction for this as well?
REST runtime Config
updates
Updating the Config
object should not represent an issue. updateClusterConfig
will generate the parsed updated Config
into a struct and then reading/updating values is straight forward.
The issue arises with the HealthCheck
daemon process and updating its setting. I was unable to find another daemon process which allows its configuration updated via the Config
. The following is my proposed solution after a brief look:
- Load and read the
enabled
andinterval
directly from theConfig
instead of the local data structure whenever needed. My concern with this is performance. Would it be significantly slower to get and read fromConfig
compared to data members in the object? - The
HealthChecker
process would need to run and poll theConfig
for updates periodically.runOnce
would not execute ifenabled=false
. - Another
interval
setting for the daemon to poll theConfig
for updates? Or do we set/use theHealthCheck
interval
with another default check value to poll for updates ifenabled=false
? We can poll for updates at the regularinerval
ifenabled=true
. -
stop
will remain as it initially was because theHealthCheck
daemon process will consistently be running. - I am unsure if the primary purpose of disabling the
HealthCheck
was for performance/overhead but only running the daemon would be lightweight.
Another route would be to leverage RPCs or some sort of Observer pattern with chan
s to propagate updates.
hi @surahman , @craigcondit can we address the hot-refresh in the next JIRA? In this one, let's focus on getting this configurable. @wilfred-s need your input here about the format of the config file, we need to get to a consensus so @surahman can move forward. thanks!