cilium-cli icon indicating copy to clipboard operation
cilium-cli copied to clipboard

test: Add DNS continuity test for Cilium restarts

Open jrajahalme opened this issue 4 years ago • 3 comments

Add support for Cilium restart tests, where some test actions may be allowed to fail during restart when Cilium agent is rebooting. Use this framework for DNS continuity test. This test fails on all current Cilium releases and is hence quarantined. The test can be run with:

Best reviewed commit by commit. Some of the commits fix problems that may directly relate to this PR.

cilium connectivity test --run-quarantined

jrajahalme avatar May 22 '21 02:05 jrajahalme

Hey Jarno,

Saw you demo this yesterday, so had a feeling this was coming. :) (didn't see the draft PR)

Unfortunately, I don't believe this kind of test belongs in the connectivity test suite. I have a few specific reasons for this, but at its core, I don't feel like you've considered the architectural implications of this change. connectivity check should aim to answer the question 'Are Pods in this cluster able to talk to each other within the constraints of the given polic(y)(ies)?' with a high degree of confidence and within a reasonable amount of time.

I don't think this test fits those criteria, as it introduces side effects. This kind of test would be a better fit for a chaos or regression testing suite instead, and can (and should) be run in a much tighter environment that doesn't involve k8s at all. We should aim to implement tests using a minimal amount of external factors and dependencies, and this is something that could easily be run as an integration test in a simple VM. (not an end-to-end test like here)

I also feel like this PR is repeating some of the mistakes that led to the current state of k8sT/, and would love to review this as smaller, more focused PRs. For example, it might be time to add a proper Command type or interface to pass to ExecInPod() instead of having curl() and curlName() that just dumps out a bunch of strings. Now's the time to refactor instead of introducing technical debt.

Some additional thoughts:

  • The RunQuarantined stuff would be another great example use case for a common --ci flag that configures cilium-cli internally to exhibit all behaviours we want on CI. No more extra flags that are just used for CI, please. @nbusseneau
  • cilium-cli is distributed as a tool to the user, so we should consider all of it production code. Running connectivity check on a production cluster should be completely safe and should not be expected to kill anything that might be necessary for the functionality of the cluster at large. If we do decide go down this road, this should be clearly documented and the user should at least be prompted that running it might introduce connectivity interruptions. (!)
  • This approach also clashes with the concurrent testing strategy we have in mind. The cluster should remain functional.

Let me know if you're up for some brainstorming, might be a good opportunity to bring up in #testing or pulling some folks together for a call.

ti-mo avatar Jun 03 '21 13:06 ti-mo

Did a quick read: code looks good but I agree with all of Timo's points, and think this type of test would be more fit for fuzzy / chaos testing rather than be part of Cilium CLI coverage. Happy to join for a brainstorm, I think Maciej would be as well :)

nbusseneau avatar Jun 03 '21 15:06 nbusseneau

No problem having this kind of test in a different CI-only tool. I'll change this to draft until we have some tooling like that and separate the more general fixes included here into separate PRs. Needed this test to validate a Cilium fix, so keeping this here for now.

jrajahalme avatar Jun 03 '21 16:06 jrajahalme