cluster-api icon indicating copy to clipboard operation
cluster-api copied to clipboard

🌱 WIP: Add MP back to dualstack E2E test

Open willie-yao opened this issue 1 year ago • 20 comments

What this PR does / why we need it: This PR adds MachinePools back to the dualstack E2E test and fixes the issue found in #9477

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged): Fixes part of #10028

/area clusterclass /area e2e-testing

willie-yao avatar Feb 09 '24 22:02 willie-yao

/test pull-cluster-api-e2e-full-dualstack-and-ipv6-main

willie-yao avatar Feb 09 '24 22:02 willie-yao

@willie-yao: The specified target(s) for /test were not found. The following commands are available to trigger required jobs:

  • /test pull-cluster-api-build-main
  • /test pull-cluster-api-e2e-blocking-main
  • /test pull-cluster-api-e2e-conformance-ci-latest-main
  • /test pull-cluster-api-e2e-conformance-main
  • /test pull-cluster-api-e2e-dualstack-and-ipv6-main
  • /test pull-cluster-api-e2e-main
  • /test pull-cluster-api-e2e-mink8s-main
  • /test pull-cluster-api-e2e-upgrade-1-29-1-30-main
  • /test pull-cluster-api-test-main
  • /test pull-cluster-api-test-mink8s-main
  • /test pull-cluster-api-verify-main

The following commands are available to trigger optional jobs:

  • /test pull-cluster-api-apidiff-main

Use /test all to run the following jobs that were automatically triggered:

  • pull-cluster-api-apidiff-main
  • pull-cluster-api-build-main
  • pull-cluster-api-e2e-blocking-main
  • pull-cluster-api-test-main
  • pull-cluster-api-verify-main

In response to this:

/test pull-cluster-api-e2e-full-dualstack-and-ipv6-main

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot avatar Feb 09 '24 22:02 k8s-ci-robot

I'm a bit confused why /pull-cluster-api-e2e-dualstack-and-ipv6-main is running the entire E2E test suite, but the most recent failure seems to come from a non-dualstack test. @killianmuldoon is this the same error you observed in #9477? The previous two runs passed, but I will run this test a few more times.

willie-yao avatar Feb 13 '24 00:02 willie-yao

Okay I actually got an error from the ipv6 test on the fourth run with this error: Unable to run conformance tests: error container run failed with exit code 1. Not sure if this is the failure that was observed before.

willie-yao avatar Feb 13 '24 17:02 willie-yao

@killianmuldoon I've retested the dualstack test around 8 times and it only failed once on the dualstack spec (other failures were flakes on other specs). What was the original error that caused the dualstack tests to fail? This is the test run that failed: https://prow.k8s.io/view/gs/kubernetes-jenkins/pr-logs/pull/kubernetes-sigs_cluster-api/10135/pull-cluster-api-e2e-dualstack-and-ipv6-main/1757196147123818496

willie-yao avatar Feb 23 '24 18:02 willie-yao

/test pull-cluster-api-e2e-conformance-ci-latest-main

chrischdi avatar Feb 28 '24 18:02 chrischdi

/test pull-cluster-api-e2e-conformance-main

chrischdi avatar Feb 28 '24 18:02 chrischdi

Sorry I've been missing the updated here!

For context the dualstack tests runs the full CAPI tests suite - the same as e2e full - and adds a couple of dualstack conformance tests on top. I can't remember the source of the failures in #9477 unfortunately, and I'm not sure why it's not written down anywhere 😅.

I will say looking at the history for this PR - https://prow.k8s.io/pr-history/?org=kubernetes-sigs&repo=cluster-api&pr=10135 - this looks way too flaky to merge.

If I'm right then there's something in how we're running these tests that means they only fail when we touch the non-dualstack network of the MachinePools in some way.

killianmuldoon avatar Feb 28 '24 18:02 killianmuldoon

@killianmuldoon No problem, thanks for the update! I saw that the dualstack test was a bit flaky on this PR, but the same error did not show up twice and it looks like the conformance tests also passed. I think @fabriziopandini mentioned in the office hours that they did a few fixes to the MPs that may have fixed the issue. I'll continue to run these tests and see if there's anything reoccuring.

willie-yao avatar Feb 28 '24 21:02 willie-yao

AFAIR it's not just the conformance tests that are the problem in this instance - it's also down to which CAPI components end up on which nodes.

One idea is to set up a repeating test - maybe five runs - and see if this change can pass that.

killianmuldoon avatar Feb 28 '24 23:02 killianmuldoon

/test pull-cluster-api-e2e-dualstack-and-ipv6-main /test pull-cluster-api-e2e-conformance-ci-latest-main /test pull-cluster-api-e2e-conformance-main

willie-yao avatar Mar 19 '24 22:03 willie-yao

@willie-yao - I think there are code changes needed on this one. If you'd like to organise something we can try to get to the bottom of it a bit by going through some of the logs etc.

killianmuldoon avatar Mar 20 '24 08:03 killianmuldoon

@killianmuldoon Sure thing, I'd be happy to pair on this. Since the same errors existed without MachinePools, I'm a bit unsure where to look to fix this as I'm not too familiar with the dualstack implementation. Is there somewhere I should look to get a better idea for this?

willie-yao avatar Mar 20 '24 20:03 willie-yao

I think the best place would be to debug through the logs of the failed jobs to understand exactly what's going on when they fial

killianmuldoon avatar Mar 21 '24 08:03 killianmuldoon

@willie-yao could you post some equivalent dualstack test failures that occur on clusters without MachinePools?

jackfrancis avatar Mar 21 '24 14:03 jackfrancis

Just a general comment. I think we probably should get the tests green on main first before introducing another potential source of errors.

Except if we feel confident the new MP part is stable and won't add more flakiness (Otherwise the test becomes harder to fix)

sbueringer avatar Mar 25 '24 06:03 sbueringer

Just a general comment. I think we probably should get the tests green on main first before introducing another potential source of errors.

I definitely agree. I'm seeing the same flakes on this PR as the ones that show up on main. It does seem like adding MPs makes it more flaky, but the reason is not super clear to me.

willie-yao avatar Mar 25 '24 22:03 willie-yao

Update: We decided in office hours that we should fix the flakes in the dualstack test for MachineDeployments before implementing them for MachinePools. Therefore, the dualstack test for MachinePools will be non-blocking for MachinePool graduation.

willie-yao avatar Mar 27 '24 20:03 willie-yao

PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot avatar Apr 09 '24 03:04 k8s-ci-robot

/test pull-cluster-api-e2e-dualstack-and-ipv6-main /test pull-cluster-api-e2e-conformance-ci-latest-main /test pull-cluster-api-e2e-conformance-main

willie-yao avatar Apr 15 '24 16:04 willie-yao

/test pull-cluster-api-e2e-dualstack-and-ipv6-main /test pull-cluster-api-e2e-conformance-ci-latest-main /test pull-cluster-api-e2e-conformance-main

willie-yao avatar Apr 15 '24 22:04 willie-yao

/test pull-cluster-api-e2e-dualstack-and-ipv6-main /test pull-cluster-api-e2e-conformance-ci-latest-main /test pull-cluster-api-e2e-conformance-main

sbueringer avatar Apr 16 '24 05:04 sbueringer

Different/unrelated flake: #10356

No Control Plane machines came into existence.

https://prow.k8s.io/view/gs/kubernetes-jenkins/pr-logs/pull/kubernetes-sigs_cluster-api/10135/pull-cluster-api-e2e-dualstack-and-ipv6-main/1780103397202989056

/test pull-cluster-api-e2e-dualstack-and-ipv6-main /test pull-cluster-api-e2e-conformance-ci-latest-main /test pull-cluster-api-e2e-conformance-main

chrischdi avatar Apr 16 '24 07:04 chrischdi

I would suggest let's merge this PR. And once this was also stable for a bit (maybe a week?). Let's continue with merging dualstack & e2e-main?

(cc @killianmuldoon)

I think we should be mostly good, just want to do it incrementally and another week shouldn't hurt.

sbueringer avatar Apr 16 '24 08:04 sbueringer

sounds good to me

killianmuldoon avatar Apr 16 '24 08:04 killianmuldoon

(Or depending on how confident we are in this PR, maybe we'll do it the other way around)

sbueringer avatar Apr 16 '24 10:04 sbueringer

This might be a MachinePool-specific issue: When testing Cluster API working on self-hosted clusters using ClusterClass [ClusterClass] Should pivot the bootstrap cluster to a self-hosted cluster

https://prow.k8s.io/view/gs/kubernetes-jenkins/pr-logs/pull/kubernetes-sigs_cluster-api/10135/pull-cluster-api-e2e-dualstack-and-ipv6-main/1780141833020510208

sbueringer avatar Apr 16 '24 10:04 sbueringer

@sbueringer is your comment about pivoting to a self-hosted cluster feedback for a potential chante to this PR?

jackfrancis avatar Apr 16 '24 19:04 jackfrancis

@sbueringer is your comment about pivoting to a self-hosted cluster feedback for a potential chante to this PR?

I was just quoting the name of the failed test 😃

sbueringer avatar Apr 17 '24 05:04 sbueringer

Ah wait, this PR doesn't influence that test. Or at least not directly

sbueringer avatar Apr 17 '24 05:04 sbueringer