cluster-api icon indicating copy to clipboard operation
cluster-api copied to clipboard

No Control Plane machines came into existence.

Open adilGhaffarDev opened this issue 1 year ago • 10 comments

Which jobs are flaking?

  • periodic-cluster-api-e2e-main
  • periodic-cluster-api-e2e-mink8s-main
  • periodic-cluster-api-e2e-dualstack-and-ipv6-release-1-6
  • periodic-cluster-api-e2e-mink8s-main
  • periodic-cluster-api-e2e-release-1-4

Which tests are flaking?

  • When following the Cluster API quick-start with Ignition Should create a workload cluster
  • When upgrading a workload cluster using ClusterClass with a HA control plane [ClusterClass] Should create and upgrade a workload cluster and eventually run kubetest
  • When testing MachineDeployment scale out/in Should successfully scale a MachineDeployment up and down upon changes to the MachineDeployment replica count
  • When testing clusterctl upgrades using ClusterClass (v1.5=>current) [ClusterClass] Should create a management cluster and then upgrade all the providers
  • When testing ClusterClass changes [ClusterClass] Should successfully rollout the managed topology upon changes to the ClusterClass

Since when has it been flaking?

Minor flakes with this error have been happening for a long time.

Testgrid link

https://testgrid.k8s.io/sig-cluster-lifecycle-cluster-api#capi-e2e-main

Reason for failure (if possible)

No response

Anything else we need to know?

No response

Label(s) to be applied

/kind flake One or more /area label. See https://github.com/kubernetes-sigs/cluster-api/labels?q=area for the list of labels.

adilGhaffarDev avatar Apr 02 '24 07:04 adilGhaffarDev

/triage accepted

Thx for reporting

sbueringer avatar Apr 03 '24 05:04 sbueringer

Would be good to add a link either to a specific failed job or k8s-triage filtered down on this failure.

Just to make it easier to find a failed job

sbueringer avatar Apr 03 '24 05:04 sbueringer

/priority important-soon

fabriziopandini avatar Apr 11 '24 16:04 fabriziopandini

k8s-triage link

chrischdi avatar Apr 12 '24 10:04 chrischdi

I did hit this issue in a local setup. However, its quite hard to triage, because the machine already got replaced by a new one (I guess because of MHC doing its thing) and the cluster successfully started then.

I still have stuff around if there are ideas to filter information.

chrischdi avatar Apr 16 '24 07:04 chrischdi

I was able to hit it again and triage a bit.

It turns out that the node itself came up, except the parts which try to hit the load balanced control-plane endpoint.

TLDR: The haproxy load balancer did not forward traffic to the control plane node.

My current theory is:

  • CAPD wrote the new haproxy config file to the lb container, which includes the first control plane node as backend
    • I can confirm that this one was correct in my setup by reading the file again using `docker run cp
  • CAPD did signal to haproxy to reload the config (by sending SIGHUP)
  • Afterwards CAPD did still not route requests to the running node.

I was able to "fix" the issue in this case by again sending SIGHUP to haproxy: docker kill -s SIGHUP <container> Afterwards haproxy did route requests to the running node.

I'm currently testing the following fix locally which is: reading and comparing the configfile in CAPD after writing it and before reloading haproxy:

  • #10453

Test setup:

GINKGO_FOCUS="PR-Blocking"
GINKGO_SKIP="\[Conformance\]"

So it only runs a single test.

I used prowjob on kind to create a kind cluster and pod yaml, which I then modified (adjusted timeouts + GINKGO_FOCUS + requests, propably other things too).

I then run the loop using ./scripts/test-pj.sh.

All code changes for my setup are here for reference: https://github.com/kubernetes-sigs/cluster-api/commit/dfe9d5e2478dbdc73b59be5533ea59234c8d2a9c

I did some optimisations, like packing all required images to scripts/images.tar and loading them from there instead of building + propably some others to make it faster and rely less on internet to not run into rate limiting stuff too.

chrischdi avatar Apr 17 '24 15:04 chrischdi

Fixes are merged, let's check next week or so if the error occurs again.

chrischdi avatar Apr 18 '24 09:04 chrischdi

The merged fix did not help.

chrischdi avatar Apr 23 '24 10:04 chrischdi

For reference, I did hit the same issue (CAPD load balancer config not active) as described in this comment on a 0.4 => 1.6 => current upgrade test but with a slightly different log:

  STEP: Initializing the workload cluster with older versions of providers @ 04/24/24 08:03:54.723
  INFO: clusterctl init --config /logs/artifacts/repository/clusterctl-config.v1.2.yaml --kubeconfig /tmp/e2e-kubeconfig2940028556 --wait-providers --core cluster-api:v0.4.8 --bootstrap kubeadm:v0.4.8 --control-plane kubeadm:v0.4.8 --infrastructure docker:v0.4.8
  INFO: Waiting for provider controllers to be running
  STEP: Waiting for deployment capd-system/capd-controller-manager to be available @ 04/24/24 08:04:31.768
  INFO: Creating log watcher for controller capd-system/capd-controller-manager, pod capd-controller-manager-7cb759f76b-whdwb, container manager
  STEP: Waiting for deployment capi-kubeadm-bootstrap-system/capi-kubeadm-bootstrap-controller-manager to be available @ 04/24/24 08:04:31.892
  INFO: Creating log watcher for controller capi-kubeadm-bootstrap-system/capi-kubeadm-bootstrap-controller-manager, pod capi-kubeadm-bootstrap-controller-manager-b67d5f4cb-8kppj, container manager
  STEP: Waiting for deployment capi-kubeadm-control-plane-system/capi-kubeadm-control-plane-controller-manager to be available @ 04/24/24 08:04:31.918
  INFO: Creating log watcher for controller capi-kubeadm-control-plane-system/capi-kubeadm-control-plane-controller-manager, pod capi-kubeadm-control-plane-controller-manager-69846d766d-k8ptj, container manager
  STEP: Waiting for deployment capi-system/capi-controller-manager to be available @ 04/24/24 08:04:31.95
  INFO: Creating log watcher for controller capi-system/capi-controller-manager, pod capi-controller-manager-7c9ccb586-5mkbx, container manager
  STEP: THE MANAGEMENT CLUSTER WITH THE OLDER VERSION OF PROVIDERS IS UP&RUNNING! @ 04/24/24 08:04:32.176
  STEP: Creating a namespace for hosting the clusterctl-upgrade test workload cluster @ 04/24/24 08:04:32.177
  INFO: Creating namespace clusterctl-upgrade
  INFO: Creating event watcher for namespace "clusterctl-upgrade"
  STEP: Creating a test workload cluster @ 04/24/24 08:04:32.193
  INFO: Creating the workload cluster with name "clusterctl-upgrade-o3zf09" using the "(default)" template (Kubernetes v1.23.17, 1 control-plane machines, 1 worker machines)
  INFO: Getting the cluster template yaml
  INFO: clusterctl config cluster clusterctl-upgrade-o3zf09 --infrastructure docker --kubernetes-version v1.23.17 --control-plane-machine-count 1 --worker-machine-count 1 --flavor (default)
  INFO: Applying the cluster template yaml to the cluster
  STEP: Waiting for the machines to exist @ 04/24/24 08:04:44.941
  [FAILED] in [It] - /home/prow/go/src/sigs.k8s.io/cluster-api/test/e2e/clusterctl_upgrade.go:456 @ 04/24/24 08:09:44.953
  STEP: Dumping logs from the "clusterctl-upgrade-hs16jr" workload cluster @ 04/24/24 08:09:44.953
  [FAILED] in [AfterEach] - /home/prow/go/src/sigs.k8s.io/cluster-api/test/framework/cluster_proxy.go:311 @ 04/24/24 08:12:44.955
  << Timeline

  [FAILED] Timed out after 300.001s.
  Timed out waiting for all Machines to exist
  Expected
      <int64>: 0
  to equal
      <int64>: 2
  In [It] at: /home/prow/go/src/sigs.k8s.io/cluster-api/test/e2e/clusterctl_upgrade.go:456 @ 04/24/24 08:09:44.953

  Full Stack Trace
    sigs.k8s.io/cluster-api/test/e2e.ClusterctlUpgradeSpec.func2()
        /home/prow/go/src/sigs.k8s.io/cluster-api/test/e2e/clusterctl_upgrade.go:456 +0x28cd

chrischdi avatar Apr 24 '24 09:04 chrischdi

I'll investigate more on this issue. /assign

pravarag avatar May 24 '24 05:05 pravarag

I just checked this flake again and it seems it now occur only in 1.5, and most probably it is also related to CAPD

Considering this, I'm going to close for now /close

fabriziopandini avatar Jul 18 '24 18:07 fabriziopandini

@fabriziopandini: Closing this issue.

In response to this:

I just checked this flake again and it seems it now occur only in 1.5, and most probably it is also related to CAPD

Considering this, I'm going to close for now /close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

k8s-ci-robot avatar Jul 18 '24 18:07 k8s-ci-robot