self-node-remediation Configurable minimum worker nodecount 2024 10 02

Why we need this PR

Add support for a custom amount of peers

Changes made

Which issue(s) this PR fixes

Test plan

Oct 02 '24 17:10 novasbc

/test 4.15-openshift-e2e

Oct 02 '24 17:10 novasbc

Hi @novasbc. Thanks for your PR.

I'm waiting for a medik8s member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Oct 02 '24 17:10 openshift-ci[bot]

@novasbc: Cannot trigger testing until a trusted user reviews the PR and leaves an /ok-to-test message.

In response to this:

/test 4.15-openshift-e2e

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Oct 02 '24 17:10 openshift-ci[bot]

Hi, do you mind extending the description please? What's the issue, how do you fix it, how do you test the changes... Also, please check the failed test. Thanks

Oct 08 '24 09:10 slintes

Hi, do you mind extending the description please? What's the issue, how do you fix it, how do you test the changes... Also, please check the failed test. Thanks

@slintes I updated the description, included the issue # as well.

Also, fixed the build which was failing with 'make verify-bundle', because the bundle hadn't been updated.

Oct 16 '24 15:10 novasbc

Thanks!

/test 4.16-openshift-e2e

Oct 17 '24 09:10 slintes

fixed an issue which was causing a failure with make test, regarding rebooter being nil

Oct 17 '24 20:10 novasbc

/test 4.15-openshift-e2e

Oct 18 '24 14:10 novasbc

/test 4.16-openshift-e2e

Oct 18 '24 14:10 novasbc

/test 4.15-openshift-e2e

Oct 22 '24 21:10 novasbc

/test 4.13-openshift-e2e

Oct 22 '24 21:10 novasbc

/test 4.13-openshift-e2e

Oct 23 '24 13:10 novasbc

@razo7 @mshitrit

I looked into the e2e failures reported over the past few days and realized that it was due to temporary/environmental issues. When I re-ran they started passing better. We can't run the tests in an openshift environment, so weren't seeing the same things locally.

Anyhow, I believe this is ready for review.

Thanks!

Oct 23 '24 17:10 novasbc

/retest

Nov 19 '24 15:11 novasbc

@slintes

hm, a unit test for this would be nice. I'm wondering if it's possible to add a test case similar to this one, but without peer and minPeers configured?

https://github.com/medik8s/self-node-remediation/blob/main/controllers/tests/controller/selfnoderemediation_controller_test.go#L438-L464

We have been unable to properly run e2e tests like the one you referenced in our infrastructure (not openshift) - we spent some decent cycles trying to get it to work.

Are you actually suggesting a unit test, or e2e? We did update one of the config unit tests, but that was super trivial & minor.

Nov 19 '24 15:11 novasbc

If it's the case of an e2e test, we can try and write it and submit and run it through your infrastructure and some iterations - I just expect that to take quite some time to get right w/o the ability to run and debug locally.

I'd certainly be willing to have our team put some more effort getting it running - but would love to potentially get this set of changes pushed upstream so that we are no longer using our locally forked version - which makes build & test additionally difficult.

Nov 19 '24 15:11 novasbc

/test 4.12-openshift-e2e

Nov 19 '24 19:11 novasbc

@novasbc: The specified target(s) for /test were not found. The following commands are available to trigger required jobs:

/test 4.14-ci-bundle-self-node-remediation-bundle
/test 4.14-images
/test 4.14-openshift-e2e
/test 4.14-test
/test 4.15-ci-bundle-self-node-remediation-bundle
/test 4.15-images
/test 4.15-openshift-e2e
/test 4.15-test
/test 4.16-ci-bundle-self-node-remediation-bundle
/test 4.16-images
/test 4.16-openshift-e2e
/test 4.16-test
/test 4.17-ci-bundle-self-node-remediation-bundle
/test 4.17-images
/test 4.17-openshift-e2e
/test 4.17-test
/test 4.18-ci-bundle-self-node-remediation-bundle
/test 4.18-images
/test 4.18-openshift-e2e
/test 4.18-test

Use /test all to run all jobs.

In response to this:

/test 4.12-openshift-e2e

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Nov 19 '24 19:11 openshift-ci[bot]

/retest

Nov 19 '24 20:11 novasbc

Are you actually suggesting a unit test

I wrote unit test, and pointed to a unit test, so this is a yes :)

but would love to potentially get this set of changes pushed upstream

And we would like to have a test that verifies that at least setting minPeersForRemediation to 0 has the desired effect.

Nov 20 '24 06:11 slintes

Are you actually suggesting a unit test

I wrote unit test, and pointed to a unit test, so this is a yes :)

but would love to potentially get this set of changes pushed upstream

And we would like to have a test that verifies that at least setting minPeersForRemediation to 0 has the desired effect.

Apologies, when I looked at the linked code, I thought it was actually executing as an e2e test, will look more closely. Locally, they were not running with the unit tests with make test

Nov 20 '24 20:11 novasbc

Confirmed, I can get a unit test running in that context, it was misunderstanding of what was actually going on with the linked context.

Nov 20 '24 21:11 novasbc

/retest-required

Jan 23 '25 02:01 novasbc

@mshitrit cleaned up leftovers, and moved things around per requests. Looked at the latest updated commit with fresh eyes, and does seem pretty clean.

pretty sure the last time i looked at it my eyes were bleeding from maintaining a local version which still has the debug changes and some other stuff that can't go in yet 😔

Jan 23 '25 15:01 novasbc

/hold

I'm having a hard to spot the difference between the old existing test and the new one, and either I'm misunderstanding something, the old test is broken, or the new doesn't add any value? Atm I tend to the 2nd option (why should it trigger a reboot / stop the the watchdog?). Let me do some tests on my machine...

Jan 23 '25 15:01 slintes

/hold

I'm having a hard to spot the difference between the old existing test and the new one, and either I'm misunderstanding something, the old test is broken, or the new doesn't add any value? Atm I tend to the 2nd option (why should it trigger a reboot / stop the the watchdog?). Let me do some tests on my machine...

I believe the old test is broken. When I step through the tests I find that the watchdog has already been triggered and is not in a ready state.

Part of the reason that I had so many misses on debug code hitting the PR here is because I backed out a bunch of changes I did to fix the watchdog issue.

Basically in the original test, I find that it exits out very quickly and still passes.

Jan 23 '25 16:01 novasbc

My local branch has additions on the original test to validate that the watchdog isn't triggered.

In the new test it validated that it WAS triggered.

Jan 23 '25 16:01 novasbc

I think the old test succeeds, even though it shouldn't, because of this change: https://github.com/medik8s/self-node-remediation/pull/220/commits/cda2f3f9a5b93e592a88eb6b1fc7b0c85a250623#diff-9a251a7c17883073c468864af09499f7005e3dd80e4624c1c63c53868450b087R86-R92 When I remove the cancel() calls, the test fails (as I would expect). Those cancel calls stop the k8s manager, which stops its components, and the watchdog is one of them, so the watchdog disarms, and the test succeeds (no new feed found).

However, why did the test not fail before above change? And why was the test even added like this? Without peers, and with default config, nothing should happen. And obviously when running in a pod, it works like expected, that's the issue you actually have, right? No reboot ever without peers 🤷🏼‍♂️

I will discuss this with the team, and continue investigation tomorrow, and hopefully provide a fix asap...

Jan 23 '25 16:01 slintes

However, why did the test not fail before above change? And why was the test even added like this? Without peers, and with default config, nothing should happen. And obviously when running in a pod, it works like expected, that's the issue you actually have, right? No reboot ever without peers 🤷🏼‍♂️

I will discuss this with the team, and continue investigation tomorrow, and hopefully provide a fix asap...

I really struggled trying to make another test based on the original because I couldn't understand what it was intended to do precisely.

Spent quite a many hours walking through what it was doing, and tried to infer it - and like I said, updated the test to something that appeared to be working, and validating an actual useful behavior.

If it would help, I can post the branch to my private repo and you can take a look.

I was able to get it such that the watchdog was reset at the beginning of the tests each time - I had to add a reset method and a few other things to accomplish that, but I ended up with some other problems with the k8sclient and had to abandon ship at the moment due to time constraints, and really desiring to get the core change pushed someway somehow.

When I'm back to my corporate laptop I'll give you a bit more detail

Jan 23 '25 16:01 novasbc

When I remove the cancel() calls, the test fails

but only when it's focused... 🤔

Jan 23 '25 17:01 slintes

self-node-remediation self-node-remediation copied to clipboard

Configurable minimum worker nodecount 2024 10 02

Why we need this PR

Changes made

Which issue(s) this PR fixes

Test plan

self-node-remediation
self-node-remediation copied to clipboard