yunikorn-core icon indicating copy to clipboard operation
yunikorn-core copied to clipboard

[YUNIKORN-1213] Configurable Health Checker Interval

Open surahman opened this issue 2 years ago • 8 comments

What is this PR for?

The changes contained here allow for the configuration of the Health Check interval as well as the ability to disable the health checks outright.

The format for the ConfigMap is as follows:

scheduler:
  heathcheck:
    enabled: true
    interval: 30s
  partitions:
...

What type of PR is it?

  • [ ] - Bug Fix
  • [ ] - Improvement
  • [x] - Feature
  • [ ] - Documentation
  • [ ] - Hot Fix
  • [ ] - Refactoring

Todos

  • [x] - Extend Config with a HealthCheck subsection under Scheduler.
  • [x] - Extend health_checker to support custom intervals and disabling of the health checks.
  • [ ] - Extend health_checker with messages to indicate disabled health checks. This is important so that a disabled health check is not confused with a more serious issue.
  • [ ] - RESTful ConfigMap updates should be able to update runtime configurations.
  • [x] - Fix nil pointer dereference caused by init passing a nil context into NewHealthChecker by setting to default values if context is nil.
  • [ ] - Documentation updates.

What is the Jira issue?

How should this be tested?

Tests for all changes have been added to this PR. There is a need for additional deployment testing.

Questions:

  • [x] - How would we like to handle the functions which return health check information? I can create an isCheckEnabled method that takes a *ClusterContext and return a bool based on the settings in the configs. The issue is that not all the methods take a *ClusterContext as a parameter.

surahman avatar May 19 '22 02:05 surahman

hi @surahman a few high-level feedback:

  1. Instead of the following config
healthcheck:
  enabled: %s
  interval: 99s
partitions:
  - name: default
    queues:
      - name: root

what was discussed looks like the following:

scheduler:
  healthcheck:
    enabled: true
    interval: 30s
partitions:
  - name: default
     queues:
        - name: root

We want to include all "global" (non-partition-specific) configs under a single parent "scheduler", including "checksum". I suggest using another JIRA to track the changes to the "checksum", just to make the PR smaller.

  1. Please create a JIRA to track the document changes.
  2. Please make sure the test case includes both having healthcheck section configured and not having this section.

yangwwei avatar May 19 '22 05:05 yangwwei

Codecov Report

Merging #412 (d192d7e) into master (3ba91fb) will increase coverage by 0.09%. The diff coverage is 93.02%.

@@            Coverage Diff             @@
##           master     #412      +/-   ##
==========================================
+ Coverage   69.09%   69.18%   +0.09%     
==========================================
  Files          67       67              
  Lines        9654     9676      +22     
==========================================
+ Hits         6670     6694      +24     
+ Misses       2740     2738       -2     
  Partials      244      244              
Impacted Files Coverage Δ
pkg/common/configs/config.go 76.66% <ø> (ø)
pkg/scheduler/scheduler.go 0.00% <0.00%> (ø)
pkg/scheduler/health_checker.go 86.63% <95.23%> (+1.43%) :arrow_up:
pkg/scheduler/objects/application.go 58.77% <0.00%> (-0.15%) :arrow_down:
pkg/scheduler/context.go 31.61% <0.00%> (+0.60%) :arrow_up:

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update 3ba91fb...d192d7e. Read the comment docs.

codecov[bot] avatar May 19 '22 16:05 codecov[bot]

  1. Instead of the following config
healthcheck:
  enabled: %s
  interval: 99s
partitions:
  - name: default
    queues:
      - name: root

what was discussed looks like the following:

scheduler:
  healthcheck:
    enabled: true
    interval: 30s
partitions:
  - name: default
     queues:
        - name: root

We want to include all "global" (non-partition-specific) configs under a single parent "scheduler", including "checksum". I suggest using another JIRA to track the changes to the "checksum", just to make the PR smaller.

Thank you for pointing this out, it was a misunderstanding on my part 🤦🏼‍♂️. I have created a YUNIKORN-1220 for the checksum.

  1. Please create a JIRA to track the document changes.

YUNIKORN-1219

  1. Please make sure the test case includes both having healthcheck section configured and not having this section.

I did not include a test case for this scenario because this test should cover the default condition: https://github.com/apache/yunikorn-core/blob/aacdb887fb5ff02441d5c577d23698086d711f28/pkg/scheduler/health_checker_test.go#L64-L76

I can see the need for an explicit test to catch the case though and have added one.

surahman avatar May 19 '22 17:05 surahman

I have examined all the functions for metrics generation in the HealthCheck and they are all called locally from the runOnce routine and the call tree that stems there. Given this, we can probably leave the routines as-is and add log messages at the Warn level to indicate that health metric reporting is disabled.

There are also two methods that are exported and only ever called locally and I wonder if we can switch them to not be exported? https://github.com/apache/yunikorn-core/blob/aacdb887fb5ff02441d5c577d23698086d711f28/pkg/scheduler/health_checker.go#L129 https://github.com/apache/yunikorn-core/blob/aacdb887fb5ff02441d5c577d23698086d711f28/pkg/scheduler/health_checker.go#L146

What are your thoughts @yangwwei and @craigcondit?

surahman avatar May 19 '22 21:05 surahman

I have examined all the functions for metrics generation in the HealthCheck and they are all called locally from the runOnce routine and the call tree that stems there. Given this, we can probably leave the routines as-is and add log messages at the Warn level to indicate that health metric reporting is disabled.

Let's not log anything if the reporting is disabled -- if it is, that's a conscious decision on the admin's part so let's not punish him by spamming the logs.

There are also two methods that are exported and only ever called locally and I wonder if we can switch them to not be exported?

https://github.com/apache/yunikorn-core/blob/aacdb887fb5ff02441d5c577d23698086d711f28/pkg/scheduler/health_checker.go#L129

https://github.com/apache/yunikorn-core/blob/aacdb887fb5ff02441d5c577d23698086d711f28/pkg/scheduler/health_checker.go#L146

Making these private should be fine.

craigcondit avatar May 20 '22 14:05 craigcondit

CC @wilfred-s -- Wilfred, can you give some input on direction for this as well?

craigcondit avatar May 20 '22 14:05 craigcondit

REST runtime Config updates

Updating the Config object should not represent an issue. updateClusterConfig will generate the parsed updated Config into a struct and then reading/updating values is straight forward.

The issue arises with the HealthCheck daemon process and updating its setting. I was unable to find another daemon process which allows its configuration updated via the Config. The following is my proposed solution after a brief look:

  1. Load and read the enabled and interval directly from the Config instead of the local data structure whenever needed. My concern with this is performance. Would it be significantly slower to get and read from Config compared to data members in the object?
  2. The HealthChecker process would need to run and poll the Config for updates periodically. runOnce would not execute if enabled=false.
  3. Another interval setting for the daemon to poll the Config for updates? Or do we set/use the HealthCheck interval with another default check value to poll for updates if enabled=false? We can poll for updates at the regular inerval if enabled=true.
  4. stop will remain as it initially was because the HealthCheck daemon process will consistently be running.
  5. I am unsure if the primary purpose of disabling the HealthCheck was for performance/overhead but only running the daemon would be lightweight.

Another route would be to leverage RPCs or some sort of Observer pattern with chans to propagate updates.

surahman avatar May 20 '22 18:05 surahman

hi @surahman , @craigcondit can we address the hot-refresh in the next JIRA? In this one, let's focus on getting this configurable. @wilfred-s need your input here about the format of the config file, we need to get to a consensus so @surahman can move forward. thanks!

yangwwei avatar May 20 '22 23:05 yangwwei