external-health-monitor icon indicating copy to clipboard operation
external-health-monitor copied to clipboard

Address scalability issue when Node Watcher is enabled

Open xing-yang opened this issue 4 years ago • 22 comments

We have an issue https://github.com/kubernetes-csi/external-health-monitor/issues/75 to change the code to only watch Pods and Nodes when the Node Watcher component is enabled. We still need to address the scalability issue when Node Watcher is enabled:

kubernetes/kubernetes#102452 (comment)

xing-yang avatar Jun 14 '21 19:06 xing-yang

@NickrenREN I wonder if you've seen a similar issue in production.

xing-yang avatar Jun 15 '21 00:06 xing-yang

Node Watcher is a single instance controller, what is the scalability issue ?

NickrenREN avatar Jun 15 '21 03:06 NickrenREN

@NickrenREN It affects the e2e tests. Details are in this issue: https://github.com/kubernetes/kubernetes/issues/102452

By disabling the external-health-monitor, the failure went away.

xing-yang avatar Jun 15 '21 13:06 xing-yang

IIUC, the root cause of the scalability issue you mention is: Node Watcher watches PVCs, Nodes and Pods ? I just don't understand the reason. k8s default scheduler also does the same thing.

NickrenREN avatar Jun 16 '21 03:06 NickrenREN

Watch is persistent connection, and Node Watcher is a single instance controller. Is this really the root cause ?

NickrenREN avatar Jun 16 '21 03:06 NickrenREN

I saw many API Throttlings, so maybe we can decrease the API call frequency ?

NickrenREN avatar Jun 16 '21 03:06 NickrenREN

Watch is persistent connection, and Node Watcher is a single instance controller. Is this really the root cause ?

This needs more investigation. The observation is that the failure went away when external-health-monitor was disabled, came back again when it is enabled, and went away again when it was disabled.

I saw many API Throttlings, so maybe we can decrease the API call frequency ?

We could try that.

xing-yang avatar Jun 16 '21 03:06 xing-yang

This needs more investigation. The observation is that the failure went away when external-health-monitor was disabled, came back again when it is enabled, and went away again when it was disabled.

This indicates the controller causes the failure (API throttling ?), but i still don't think Watch is the root cause.

NickrenREN avatar Jun 16 '21 03:06 NickrenREN

This indicates the controller causes the failure (API throttling ?), but i still don't think Watch is the root cause.

The external-health-monitor controller added more load to the API server which might have triggered those failures.

xing-yang avatar Jun 16 '21 03:06 xing-yang

The external-health-monitor controller added more load to the API server which might have triggered those failures.

I agree, so we can try to decrease the API call frequency first.

NickrenREN avatar Jun 16 '21 03:06 NickrenREN

I would like to work on this issue. Will start to look into it and understand.

sonasingh46 avatar Aug 25 '21 17:08 sonasingh46

/assign

sonasingh46 avatar Aug 25 '21 17:08 sonasingh46

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot avatar Nov 23 '21 17:11 k8s-triage-robot

/remove-lifecycle stale

xing-yang avatar Dec 07 '21 03:12 xing-yang

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot avatar Mar 07 '22 04:03 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-triage-robot avatar Apr 06 '22 05:04 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue or PR with /reopen
  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

k8s-triage-robot avatar May 06 '22 05:05 k8s-triage-robot

@k8s-triage-robot: Closing this issue.

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue or PR with /reopen
  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot avatar May 06 '22 05:05 k8s-ci-robot

/reopen

pohly avatar Aug 08 '22 12:08 pohly

@pohly: Reopened this issue.

In response to this:

/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot avatar Aug 08 '22 12:08 k8s-ci-robot

/lifecycle frozen

pohly avatar Aug 08 '22 12:08 pohly

/assign

mowangdk avatar Aug 30 '24 02:08 mowangdk