self-node-remediation
self-node-remediation copied to clipboard
Configurable minimum worker nodecount 2024 10 02
Why we need this PR
Add support for a custom amount of peers
Changes made
Which issue(s) this PR fixes
Test plan
/test 4.15-openshift-e2e
Hi @novasbc. Thanks for your PR.
I'm waiting for a medik8s member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.
Once the patch is verified, the new status will be reflected by the ok-to-test label.
I understand the commands that are listed here.
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.
@novasbc: Cannot trigger testing until a trusted user reviews the PR and leaves an /ok-to-test message.
In response to this:
/test 4.15-openshift-e2e
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.
Hi, do you mind extending the description please? What's the issue, how do you fix it, how do you test the changes... Also, please check the failed test. Thanks
Hi, do you mind extending the description please? What's the issue, how do you fix it, how do you test the changes... Also, please check the failed test. Thanks
@slintes I updated the description, included the issue # as well.
Also, fixed the build which was failing with 'make verify-bundle', because the bundle hadn't been updated.
Thanks!
/test 4.16-openshift-e2e
fixed an issue which was causing a failure with make test, regarding rebooter being nil
/test 4.15-openshift-e2e
/test 4.16-openshift-e2e
/test 4.15-openshift-e2e
/test 4.13-openshift-e2e
/test 4.13-openshift-e2e
@razo7 @mshitrit
I looked into the e2e failures reported over the past few days and realized that it was due to temporary/environmental issues. When I re-ran they started passing better. We can't run the tests in an openshift environment, so weren't seeing the same things locally.
Anyhow, I believe this is ready for review.
Thanks!
/retest
@slintes
hm, a unit test for this would be nice. I'm wondering if it's possible to add a test case similar to this one, but without peer and minPeers configured?
https://github.com/medik8s/self-node-remediation/blob/main/controllers/tests/controller/selfnoderemediation_controller_test.go#L438-L464
We have been unable to properly run e2e tests like the one you referenced in our infrastructure (not openshift) - we spent some decent cycles trying to get it to work.
Are you actually suggesting a unit test, or e2e? We did update one of the config unit tests, but that was super trivial & minor.
If it's the case of an e2e test, we can try and write it and submit and run it through your infrastructure and some iterations - I just expect that to take quite some time to get right w/o the ability to run and debug locally.
I'd certainly be willing to have our team put some more effort getting it running - but would love to potentially get this set of changes pushed upstream so that we are no longer using our locally forked version - which makes build & test additionally difficult.
/test 4.12-openshift-e2e
@novasbc: The specified target(s) for /test were not found.
The following commands are available to trigger required jobs:
/test 4.14-ci-bundle-self-node-remediation-bundle/test 4.14-images/test 4.14-openshift-e2e/test 4.14-test/test 4.15-ci-bundle-self-node-remediation-bundle/test 4.15-images/test 4.15-openshift-e2e/test 4.15-test/test 4.16-ci-bundle-self-node-remediation-bundle/test 4.16-images/test 4.16-openshift-e2e/test 4.16-test/test 4.17-ci-bundle-self-node-remediation-bundle/test 4.17-images/test 4.17-openshift-e2e/test 4.17-test/test 4.18-ci-bundle-self-node-remediation-bundle/test 4.18-images/test 4.18-openshift-e2e/test 4.18-test
Use /test all to run all jobs.
In response to this:
/test 4.12-openshift-e2e
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.
/retest
Are you actually suggesting a unit test
I wrote unit test, and pointed to a unit test, so this is a yes :)
but would love to potentially get this set of changes pushed upstream
And we would like to have a test that verifies that at least setting minPeersForRemediation to 0 has the desired effect.
Are you actually suggesting a unit test
I wrote unit test, and pointed to a unit test, so this is a yes :)
but would love to potentially get this set of changes pushed upstream
And we would like to have a test that verifies that at least setting
minPeersForRemediationto0has the desired effect.
Apologies, when I looked at the linked code, I thought it was actually executing as an e2e test, will look more closely. Locally, they were not running with the unit tests with make test
Confirmed, I can get a unit test running in that context, it was misunderstanding of what was actually going on with the linked context.
/retest-required
@mshitrit cleaned up leftovers, and moved things around per requests. Looked at the latest updated commit with fresh eyes, and does seem pretty clean.
pretty sure the last time i looked at it my eyes were bleeding from maintaining a local version which still has the debug changes and some other stuff that can't go in yet 😔
/hold
I'm having a hard to spot the difference between the old existing test and the new one, and either I'm misunderstanding something, the old test is broken, or the new doesn't add any value? Atm I tend to the 2nd option (why should it trigger a reboot / stop the the watchdog?). Let me do some tests on my machine...
/hold
I'm having a hard to spot the difference between the old existing test and the new one, and either I'm misunderstanding something, the old test is broken, or the new doesn't add any value? Atm I tend to the 2nd option (why should it trigger a reboot / stop the the watchdog?). Let me do some tests on my machine...
I believe the old test is broken. When I step through the tests I find that the watchdog has already been triggered and is not in a ready state.
Part of the reason that I had so many misses on debug code hitting the PR here is because I backed out a bunch of changes I did to fix the watchdog issue.
Basically in the original test, I find that it exits out very quickly and still passes.
My local branch has additions on the original test to validate that the watchdog isn't triggered.
In the new test it validated that it WAS triggered.
I think the old test succeeds, even though it shouldn't, because of this change: https://github.com/medik8s/self-node-remediation/pull/220/commits/cda2f3f9a5b93e592a88eb6b1fc7b0c85a250623#diff-9a251a7c17883073c468864af09499f7005e3dd80e4624c1c63c53868450b087R86-R92
When I remove the cancel() calls, the test fails (as I would expect). Those cancel calls stop the k8s manager, which stops its components, and the watchdog is one of them, so the watchdog disarms, and the test succeeds (no new feed found).
However, why did the test not fail before above change? And why was the test even added like this? Without peers, and with default config, nothing should happen. And obviously when running in a pod, it works like expected, that's the issue you actually have, right? No reboot ever without peers 🤷🏼♂️
I will discuss this with the team, and continue investigation tomorrow, and hopefully provide a fix asap...
However, why did the test not fail before above change? And why was the test even added like this? Without peers, and with default config, nothing should happen. And obviously when running in a pod, it works like expected, that's the issue you actually have, right? No reboot ever without peers 🤷🏼♂️
I will discuss this with the team, and continue investigation tomorrow, and hopefully provide a fix asap...
I really struggled trying to make another test based on the original because I couldn't understand what it was intended to do precisely.
Spent quite a many hours walking through what it was doing, and tried to infer it - and like I said, updated the test to something that appeared to be working, and validating an actual useful behavior.
If it would help, I can post the branch to my private repo and you can take a look.
I was able to get it such that the watchdog was reset at the beginning of the tests each time - I had to add a reset method and a few other things to accomplish that, but I ended up with some other problems with the k8sclient and had to abandon ship at the moment due to time constraints, and really desiring to get the core change pushed someway somehow.
When I'm back to my corporate laptop I'll give you a bit more detail
When I remove the cancel() calls, the test fails
but only when it's focused... 🤔