external-health-monitor
external-health-monitor copied to clipboard
Address scalability issue when Node Watcher is enabled
We have an issue https://github.com/kubernetes-csi/external-health-monitor/issues/75 to change the code to only watch Pods and Nodes when the Node Watcher component is enabled. We still need to address the scalability issue when Node Watcher is enabled:
kubernetes/kubernetes#102452 (comment)
@NickrenREN I wonder if you've seen a similar issue in production.
Node Watcher is a single instance controller, what is the scalability issue ?
@NickrenREN It affects the e2e tests. Details are in this issue: https://github.com/kubernetes/kubernetes/issues/102452
By disabling the external-health-monitor, the failure went away.
IIUC, the root cause of the scalability issue you mention is: Node Watcher watches PVCs, Nodes and Pods ? I just don't understand the reason. k8s default scheduler also does the same thing.
Watch is persistent connection, and Node Watcher is a single instance controller. Is this really the root cause ?
I saw many API Throttlings, so maybe we can decrease the API call frequency ?
Watch is persistent connection, and Node Watcher is a single instance controller. Is this really the root cause ?
This needs more investigation. The observation is that the failure went away when external-health-monitor was disabled, came back again when it is enabled, and went away again when it was disabled.
I saw many API Throttlings, so maybe we can decrease the API call frequency ?
We could try that.
This needs more investigation. The observation is that the failure went away when external-health-monitor was disabled, came back again when it is enabled, and went away again when it was disabled.
This indicates the controller causes the failure (API throttling ?), but i still don't think Watch is the root cause.
This indicates the controller causes the failure (API throttling ?), but i still don't think Watch is the root cause.
The external-health-monitor controller added more load to the API server which might have triggered those failures.
The external-health-monitor controller added more load to the API server which might have triggered those failures.
I agree, so we can try to decrease the API call frequency first.
I would like to work on this issue. Will start to look into it and understand.
/assign
The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.
This bot triages issues and PRs according to the following rules:
- After 90d of inactivity,
lifecycle/staleis applied - After 30d of inactivity since
lifecycle/stalewas applied,lifecycle/rottenis applied - After 30d of inactivity since
lifecycle/rottenwas applied, the issue is closed
You can:
- Mark this issue or PR as fresh with
/remove-lifecycle stale - Mark this issue or PR as rotten with
/lifecycle rotten - Close this issue or PR with
/close - Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
/remove-lifecycle stale
The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.
This bot triages issues and PRs according to the following rules:
- After 90d of inactivity,
lifecycle/staleis applied - After 30d of inactivity since
lifecycle/stalewas applied,lifecycle/rottenis applied - After 30d of inactivity since
lifecycle/rottenwas applied, the issue is closed
You can:
- Mark this issue or PR as fresh with
/remove-lifecycle stale - Mark this issue or PR as rotten with
/lifecycle rotten - Close this issue or PR with
/close - Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.
This bot triages issues and PRs according to the following rules:
- After 90d of inactivity,
lifecycle/staleis applied - After 30d of inactivity since
lifecycle/stalewas applied,lifecycle/rottenis applied - After 30d of inactivity since
lifecycle/rottenwas applied, the issue is closed
You can:
- Mark this issue or PR as fresh with
/remove-lifecycle rotten - Close this issue or PR with
/close - Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle rotten
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.
This bot triages issues and PRs according to the following rules:
- After 90d of inactivity,
lifecycle/staleis applied - After 30d of inactivity since
lifecycle/stalewas applied,lifecycle/rottenis applied - After 30d of inactivity since
lifecycle/rottenwas applied, the issue is closed
You can:
- Reopen this issue or PR with
/reopen - Mark this issue or PR as fresh with
/remove-lifecycle rotten - Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/close
@k8s-triage-robot: Closing this issue.
In response to this:
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.
This bot triages issues and PRs according to the following rules:
- After 90d of inactivity,
lifecycle/staleis applied- After 30d of inactivity since
lifecycle/stalewas applied,lifecycle/rottenis applied- After 30d of inactivity since
lifecycle/rottenwas applied, the issue is closedYou can:
- Reopen this issue or PR with
/reopen- Mark this issue or PR as fresh with
/remove-lifecycle rotten- Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/close
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.
/reopen
@pohly: Reopened this issue.
In response to this:
/reopen
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.
/lifecycle frozen
/assign