node-problem-detector icon indicating copy to clipboard operation
node-problem-detector copied to clipboard

tainting and untainting logic implemented via configuration

Open bilalcaliskan opened this issue 4 years ago • 67 comments

Which issue(s) this PR fixes:

this pr solve issue #457 .

What this PR does / why we need it:

This PR adds functionality of tainting and untainting a node on specific circumstances conditionally. With that improvement, node-problem-detector can be used in conjunction with descheduler. User should specify taintEnabled, taintKey, taintValue, taintEffect in config/kernel-monitor.json. If not specified, taintEnabled is false, so npd will not taint any node. With that improvement, node-problem-detector also removes taint if problem is resolved.

Special notes for your reviewer:

This improvement needs update Clusterrole to node-problem-detector. If that PR somehow merged to the master, update verb must be added to right here.

bilalcaliskan avatar May 22 '21 20:05 bilalcaliskan

Hi @bilalcaliskan. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot avatar May 22 '21 20:05 k8s-ci-robot

i have modified the commit message so there are lots of activity since PR is opened, sorry for that.

bilalcaliskan avatar May 22 '21 20:05 bilalcaliskan

/retest

bilalcaliskan avatar May 23 '21 10:05 bilalcaliskan

@bilalcaliskan: Cannot trigger testing until a trusted user reviews the PR and leaves an /ok-to-test message.

In response to this:

/retest

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot avatar May 23 '21 10:05 k8s-ci-robot

/retest

bilalcaliskan avatar Jun 06 '21 10:06 bilalcaliskan

@bilalcaliskan: Cannot trigger testing until a trusted user reviews the PR and leaves an /ok-to-test message.

In response to this:

/retest

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot avatar Jun 06 '21 10:06 k8s-ci-robot

/assign @xueweiz @andyxning

bilalcaliskan avatar Jun 06 '21 10:06 bilalcaliskan

We discussed this long time back when we firstly started NPD. The taint can be applied by NPD itself, and can also be applied by another controller that takes NPD generated conditions and events as signal. Having another controller is the pattern most people use right now, because the cluster level controller should have a global view and a better understanding of whether it is OK to taint a node or not. For example, if 5 out of 10 nodes have an issue, you may not want NPD on each node to taint the node, that may make a lot of pods nowhere to run. Instead, a controller could drain the bad nodes one by one and recreate/repair them.

Random-Liu avatar Jun 25 '21 16:06 Random-Liu

According to the remedy system section, NPD is able to add condition to node that eventually will taint them, then descheduler will evict pods that doesn't respect the taint.

Or we could be taking advantage of Taint Based Eviction instead of descheduler. https://kubernetes.io/docs/concepts/scheduling-eviction/taint-and-toleration/#taint-based-evictions

Currently, descheduler support RemovePodsViolatingNodeTaints and NPD can add condition nevertheless, there is no way to rely only those two compontents to drain nodes automatically. Draino is still required, am I right ?

azman0101 avatar Jun 30 '21 08:06 azman0101

Condition is mostly for informative.

Taint will actually affect the scheduler decision, e.g. not schedule any pod to a node any more, or evict running pods from a node. That kind of decision should be done by the cluster level controller, or else if in the extreme case half nodes decide to taint and evict pods, the cluster may not have enough resource to run those pods.

Random-Liu avatar Jun 30 '21 18:06 Random-Liu

Condition is mostly for informative.

Taint will actually affect the scheduler decision, e.g. not schedule any pod to a node any more, or evict running pods from a node. That kind of decision should be done by the cluster level controller, or else if in the extreme case half nodes decide to taint and evict pods, the cluster may not have enough resource to run those pods.

I agree 👍 I'm just saying that Remedy Systems section is misleading because it says that NPD + Descheduler can do the job together.

I want to avoid using draino so, I wonder how to taint by custom condition 🤔

azman0101 avatar Jun 30 '21 18:06 azman0101

NPD can works with the descheduler only with the predefined (Ready, MemoryPressure...) conditions. So, the current design is in my opinion definitively limited. I understand the decision of spliting responsability but the documentation leads to mesleading.

alexispires avatar Jun 30 '21 21:06 alexispires

I agree with @azman0101, descheduler does not do tainting by itself. According to remedy systems section we can use descheduler for that purpose but sadly i guess there is just one option for that and its draino.

@Random-Liu as i know, Descheduler does the job only if there are specified taints on the node. So we should use 3 different component to taint nodes on specific NodeCondition(npd, draino and Descheduler in that scenario)? I guess its too much efford especially if you have over 10 k8s clusters on production.

bilalcaliskan avatar Jul 01 '21 23:07 bilalcaliskan

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot avatar Nov 12 '21 10:11 k8s-triage-robot

/remove-lifecycle stale

vteratipally avatar Nov 13 '21 12:11 vteratipally

@andyxning can you please review this PR?

Regards.

bilalcaliskan avatar Dec 26 '21 12:12 bilalcaliskan

@Random-Liu @vteratipally @andyxning @xueweiz hello folks!

Is there any chance for this PR to be merged to master? It's been waiting for a long time but if there is no chance it to be merged, i think it is time to close it.

My personal opinion; I love node-problem-detector much and i hope eventually this feature will be implemented even not with my PR. It will provide much more control to the end users and decrease the complexity of combining multiple tools.

bilalcaliskan avatar Jan 14 '22 19:01 bilalcaliskan

/ok-to-test

vteratipally avatar Jan 14 '22 19:01 vteratipally

@vteratipally i am seeing logs like below on the failing e2e tests, do you have an idea about these? I think these logs are not related with the feature added with this PR. Is there any additional thing before running e2e tests? It looks like an environmental problem but am i missing something can not be sure :(

BeforeSuite on Node 1 failed. "transport: Error while dialing dial unix /var/run/dockershim.sock: connect: no such file or directory". Reconnecting...

Ginkgo timed out waiting for all parallel procs to report back

Jan 14 20:02:05.626: Timed out after 60.000s.

STEP: Check node-problem-detector posted KernelDeadlock condition on node "e2e-ffa7072c00-429e8-minion-group-1933" Jan 14 20:02:06.200: FAIL: Timed out after 60.000s. Expected success, but got an error: <*errors.errorString | 0xc000856330>: { s: "node condition \"KernelDeadlock\" not found", } node condition "KernelDeadlock" not found

bilalcaliskan avatar Jan 19 '22 21:01 bilalcaliskan

/retest-required

bilalcaliskan avatar Jan 19 '22 21:01 bilalcaliskan

This is actually a known issue, tests have been failing as there were some ongoing changes in the kubernetes repo related to dockershim removal. I am working on it

vteratipally avatar Jan 19 '22 21:01 vteratipally

/retest-required

bilalcaliskan avatar Jan 24 '22 21:01 bilalcaliskan

by the way, this PR looks like waiting for another PR on the kubernetes/test-infra, just writing down as a note. And here is the related issue.

bilalcaliskan avatar Feb 04 '22 07:02 bilalcaliskan

since https://github.com/kubernetes/test-infra/pull/24914 is merged, please /retest

bilalcaliskan avatar Mar 29 '22 19:03 bilalcaliskan

hello again, first of all, @vteratipally thank you for your help!

As a reviewer, @andyxning @xueweiz WDYT about that change? as i see from that PR history, lots of people will be very happy if this is merged :)

by the way merging my own PR is not important than the feature itself. I also will be glad if anyone else develop that feature with better implementation.

bilalcaliskan avatar Apr 01 '22 12:04 bilalcaliskan

/retest

vteratipally avatar Apr 22 '22 18:04 vteratipally

Any updates?

rhockenbury avatar May 05 '22 02:05 rhockenbury

Any Updates?

bryce-wilkinson-rkt avatar Jun 28 '22 03:06 bryce-wilkinson-rkt

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot avatar Sep 26 '22 04:09 k8s-triage-robot

I've played with the taint feature locally for a while. The test was made with some custom plugins from uswitch repo.

Under some script scenarios, a Node Condition can be set to Unknown (e.g. script returning status code>1) and the taint is not applied.

This may be due to the code only handling conditions of True and False.

Is it intended behavior? If so, could it be documented or adapted to match Unknown scenarios?

Awesome contribution @bilalcaliskan :+1:

tioxy avatar Oct 21 '22 17:10 tioxy