cluster-api Cluster API quick-start with dualstack is flaky

Which jobs are flaking?

capi-e2e-dualstack-and-ipv6-main

Which tests are flaking?

Failing test cases:

https://prow.k8s.io/view/gs/kubernetes-jenkins/logs/periodic-cluster-api-e2e-dualstack-and-ipv6-main/1666298385843359744
https://prow.k8s.io/view/gs/kubernetes-jenkins/logs/periodic-cluster-api-e2e-dualstack-and-ipv6-main/1666207033466032128

Since when has it been flaking?

Since the dualstack tests were merged.

Testgrid link

https://testgrid.k8s.io/sig-cluster-lifecycle-cluster-api#capi-e2e-dualstack-and-ipv6-main

Reason for failure (if possible)

Not clear at this point but there are some facts we can use to begin debugging this:

It occurs for both the IPv4 and IPv6 primary variants of the test
It fails during the should create a single stack service with cluster ip from primary service range test. The error message is: service dualstack-6332/defaultclusterip expected family IPv4 at index[0] got IPv6 or service dualstack-5879/defaultclusterip expected family IPv6 at index[0] got IPv4 depending on the test variant.

Anything else we need to know?

No response

Label(s) to be applied

/kind flake One or more /area label. See https://github.com/kubernetes-sigs/cluster-api/labels?q=area for the list of labels.

Jun 07 '23 14:06 killianmuldoon

/triage accepted

Jun 07 '23 14:06 killianmuldoon

/help

Jul 12 '23 11:07 killianmuldoon

@killianmuldoon: This request has been marked as needing help from a contributor.

Guidelines

Please ensure that the issue body includes answers to the following questions:

Why are we solving this issue?
To address this issue, are there any code changes? If there are code changes, what needs to be done in the code and what places can the assignee treat as reference points?
Does this issue have zero to low barrier of entry?
How can the assignee reach out to you for help?

For more details on the requirements of such an issue, please see here and ensure that they are met.

If this request no longer meets these requirements, the label can be removed by commenting with the /remove-help command.

In response to this:

/help

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Jul 12 '23 11:07 k8s-ci-robot

/assign @nawazkh

Aug 14 '23 15:08 nawazkh

This has been fixed with https://github.com/kubernetes-sigs/cluster-api/pull/9252 and the CI signal for capi-e2e-dualstack-and-ipv6-main looks green. Thank you @chrischdi for opening the PR and fixing the issue so quick!

Shall we close out this issue?

Aug 23 '23 16:08 nawazkh

#9252 fixed https://github.com/kubernetes-sigs/cluster-api/issues/9240 which was about the failing tests. This is a pre-existing issue about the flakiness of the dualstack tests.

They're still flaky in the same way as far as I can tell, though it's been masked a bit by the failure connected with the v1.28.0 conformance upgrade.

Aug 23 '23 16:08 killianmuldoon

Currently, in dual-stack I only see this flake: https://storage.googleapis.com/k8s-triage/index.html?job=.-cluster-api-e2e-dualstack-.&xjob=.-provider-.#3ad7d5d4d855de76fe3b

with this error:

Expected success, but got an error:
    <*errors.withStack | 0xc002954720>: 
    Unable to run conformance tests: error container run failed with exit code 1
    {
        error: <*errors.withMessage | 0xc0012ce180>{
            cause: <*errors.errorString | 0xc0012f23c0>{
                s: "error container run failed with exit code 1",
            },
            msg: "Unable to run conformance tests",
        },
        stack: [0x1f72fbc, 0x202fea5, 0x201fba6, 0x1f68eed, 0x1f67874, 0x201fb13, 0x84e3fb, 0x8629b8, 0x4725a1],
    }

This is a kubetest that is failing here: https://github.com/kubernetes-sigs/cluster-api/blob/8003f3ff6179ca1f26009e2d2b0754bcf14cb044/test/e2e/quick_start_test.go#L169C5-L169C5

Not really sure what could be the root cause of this, few things that I think we can check:

I can see that we have a pinned conformance image, the pinned image might have a flaky test, we can check the image and try to change it to see if it resolves the issue.
We can check how we are configuring the test we might need to change something in the configuration.

We need to either update this issue or create a new one for "conformance tests" flake because I believe this issue is tracking a different flake that is not happening anymore. @killianmuldoon please confirm.

Nov 15 '23 21:11 adilGhaffarDev

Maybe we could collect the logs from the conformance container when hitting the issue to further diagnoes this.

More persistent link: https://storage.googleapis.com/k8s-triage/index.html?date=2023-11-13&job=.-cluster-api-e2e-dualstack-.&xjob=.-provider-.#3ad7d5d4d855de76fe3b

Nov 20 '23 13:11 chrischdi

I would also like to add there are 2 other flakes in dual stack, they happen very rarely:

Timed out waiting for 1 nodes to be created for MachineDeployment : https://prow.k8s.io/view/gs/kubernetes-jenkins/logs/periodic-cluster-api-e2e-dualstack-and-ipv6-main/1725998755204829184
Timed out waiting for 1 ready replicas for MachinePool : https://prow.k8s.io/view/gs/kubernetes-jenkins/logs/periodic-cluster-api-e2e-dualstack-and-ipv6-main/1726633695374217216

Conformance flake is the one that happens most of the time.

Nov 21 '23 10:11 adilGhaffarDev

Not working this actively, so unassuming myself now. But please feel free to pull me in @adilGhaffarDev when debugging /unassign

Jan 24 '24 17:01 nawazkh

/assign

Feb 03 '24 21:02 hackeramitkumar

Maybe we could collect the logs from the conformance container when hitting the issue to further diagnoes this.

We are already collecting them. This error is for ipv4 primary When following the Cluster API quick-start with dualstack and ipv4 primary [IPv6] Should create a workload cluster:

   • [FAILED] [15.394 seconds]
  [sig-network] [Feature:IPv6DualStack] [It] should create a single stack service with cluster ip from primary service range
  test/e2e/network/dual_stack.go:204
  
    [FAILED] service dualstack-390/defaultclusterip expected family IPv4 at index[0] got IPv6
    In [It] at: test/e2e/network/dual_stack.go:704 @ 02/23/24 08:12:31.746

In the case of ipv6 primary When following the Cluster API quick-start with dualstack and ipv6 primary [IPv6] Should create a workload cluster, we see this:

   • [FAILED] [18.334 seconds]
  [sig-network] [Feature:IPv6DualStack] [It] should create a single stack service with cluster ip from primary service range
  test/e2e/network/dual_stack.go:204
  
    [FAILED] service dualstack-5886/defaultclusterip expected family IPv6 at index[0] got IPv4
    In [It] at: test/e2e/network/dual_stack.go:704 @ 02/16/24 00:27:29.771

Feb 23 '24 09:02 adilGhaffarDev

@killianmuldoon it seems that flaky dualstack tests are general and not related to MachinePools? Ref:

https://github.com/kubernetes-sigs/cluster-api/pull/9477

@willie-yao is tracking restoring those tests as part of graduating MachinePool from experimental.

What should the path forward be for MachinePools + dualstack tests given all of this context?

Mar 19 '24 18:03 jackfrancis

What should the path forward be for MachinePools + dualstack tests given all of this context?

I think the MachinePool versions of these tests are much flakier than the current test. I think we should figure out the issues in the MachinePool PR while continuing to try to triage and fix this separate underlying flake.

Mar 20 '24 08:03 killianmuldoon

/priority important-soon

Apr 11 '24 18:04 fabriziopandini

This seems to be good since #10424 merged 🎉

https://storage.googleapis.com/k8s-triage/index.html?date=2024-04-15&job=.-cluster-api-.&test=dualstack&xjob=.-provider-.

/close

Apr 15 '24 12:04 chrischdi

@chrischdi: Closing this issue.

In response to this:

This seems to be good since #10424 merged 🎉

https://storage.googleapis.com/k8s-triage/index.html?date=2024-04-15&job=.-cluster-api-.&test=dualstack&xjob=.-provider-.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Apr 15 '24 12:04 k8s-ci-robot

Great work on fixing this!

Apr 15 '24 12:04 killianmuldoon

Yikes, fun one :)

@willie-yao you should be able to re-introduce MachinePools into tests and we'll be able to confirm right away that there's no MachinePool regression here.

Apr 15 '24 15:04 jackfrancis

Good catch!

Apr 15 '24 16:04 sbueringer

Note: This got cherry-picked back to v1.5, v1.6 and v1.7.

Apr 30 '24 12:04 chrischdi

cluster-api cluster-api copied to clipboard

Cluster API quick-start with dualstack is flaky

Which jobs are flaking?

Which tests are flaking?

Since when has it been flaking?

Testgrid link

Reason for failure (if possible)

Anything else we need to know?

Label(s) to be applied

Guidelines

cluster-api
cluster-api copied to clipboard