external-health-monitor Address scalability issue when Node Watcher is enabled

We have an issue https://github.com/kubernetes-csi/external-health-monitor/issues/75 to change the code to only watch Pods and Nodes when the Node Watcher component is enabled. We still need to address the scalability issue when Node Watcher is enabled:

kubernetes/kubernetes#102452 (comment)

Jun 14 '21 19:06 xing-yang

@NickrenREN I wonder if you've seen a similar issue in production.

Jun 15 '21 00:06 xing-yang

Node Watcher is a single instance controller, what is the scalability issue ?

Jun 15 '21 03:06 NickrenREN

@NickrenREN It affects the e2e tests. Details are in this issue: https://github.com/kubernetes/kubernetes/issues/102452

By disabling the external-health-monitor, the failure went away.

Jun 15 '21 13:06 xing-yang

IIUC, the root cause of the scalability issue you mention is: Node Watcher watches PVCs, Nodes and Pods ? I just don't understand the reason. k8s default scheduler also does the same thing.

Jun 16 '21 03:06 NickrenREN

Watch is persistent connection, and Node Watcher is a single instance controller. Is this really the root cause ?

Jun 16 '21 03:06 NickrenREN

I saw many API Throttlings, so maybe we can decrease the API call frequency ?

Jun 16 '21 03:06 NickrenREN

Watch is persistent connection, and Node Watcher is a single instance controller. Is this really the root cause ?

This needs more investigation. The observation is that the failure went away when external-health-monitor was disabled, came back again when it is enabled, and went away again when it was disabled.

I saw many API Throttlings, so maybe we can decrease the API call frequency ?

We could try that.

Jun 16 '21 03:06 xing-yang

This needs more investigation. The observation is that the failure went away when external-health-monitor was disabled, came back again when it is enabled, and went away again when it was disabled.

This indicates the controller causes the failure (API throttling ?), but i still don't think Watch is the root cause.

Jun 16 '21 03:06 NickrenREN

This indicates the controller causes the failure (API throttling ?), but i still don't think Watch is the root cause.

The external-health-monitor controller added more load to the API server which might have triggered those failures.

Jun 16 '21 03:06 xing-yang

The external-health-monitor controller added more load to the API server which might have triggered those failures.

I agree, so we can try to decrease the API call frequency first.

Jun 16 '21 03:06 NickrenREN

I would like to work on this issue. Will start to look into it and understand.

Aug 25 '21 17:08 sonasingh46

/assign

Aug 25 '21 17:08 sonasingh46

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

Nov 23 '21 17:11 k8s-triage-robot

/remove-lifecycle stale

Dec 07 '21 03:12 xing-yang

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

Mar 07 '22 04:03 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

Apr 06 '22 05:04 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue or PR with /reopen
Mark this issue or PR as fresh with /remove-lifecycle rotten
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

May 06 '22 05:05 k8s-triage-robot

@k8s-triage-robot: Closing this issue.

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied

After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied

After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue or PR with /reopen

Mark this issue or PR as fresh with /remove-lifecycle rotten

Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

May 06 '22 05:05 k8s-ci-robot

/reopen

Aug 08 '22 12:08 pohly

@pohly: Reopened this issue.

In response to this:

/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Aug 08 '22 12:08 k8s-ci-robot

/lifecycle frozen

Aug 08 '22 12:08 pohly

/assign

Aug 30 '24 02:08 mowangdk

external-health-monitor external-health-monitor copied to clipboard

Address scalability issue when Node Watcher is enabled

external-health-monitor
external-health-monitor copied to clipboard