cluster-api
cluster-api copied to clipboard
No Control Plane machines came into existence.
Which jobs are flaking?
- periodic-cluster-api-e2e-main
- periodic-cluster-api-e2e-mink8s-main
- periodic-cluster-api-e2e-dualstack-and-ipv6-release-1-6
- periodic-cluster-api-e2e-mink8s-main
- periodic-cluster-api-e2e-release-1-4
Which tests are flaking?
- When following the Cluster API quick-start with Ignition Should create a workload cluster
- When upgrading a workload cluster using ClusterClass with a HA control plane [ClusterClass] Should create and upgrade a workload cluster and eventually run kubetest
- When testing MachineDeployment scale out/in Should successfully scale a MachineDeployment up and down upon changes to the MachineDeployment replica count
- When testing clusterctl upgrades using ClusterClass (v1.5=>current) [ClusterClass] Should create a management cluster and then upgrade all the providers
- When testing ClusterClass changes [ClusterClass] Should successfully rollout the managed topology upon changes to the ClusterClass
Since when has it been flaking?
Minor flakes with this error have been happening for a long time.
Testgrid link
https://testgrid.k8s.io/sig-cluster-lifecycle-cluster-api#capi-e2e-main
Reason for failure (if possible)
No response
Anything else we need to know?
No response
Label(s) to be applied
/kind flake One or more /area label. See https://github.com/kubernetes-sigs/cluster-api/labels?q=area for the list of labels.
/triage accepted
Thx for reporting
Would be good to add a link either to a specific failed job or k8s-triage filtered down on this failure.
Just to make it easier to find a failed job
/priority important-soon
I did hit this issue in a local setup. However, its quite hard to triage, because the machine already got replaced by a new one (I guess because of MHC doing its thing) and the cluster successfully started then.
I still have stuff around if there are ideas to filter information.
I was able to hit it again and triage a bit.
It turns out that the node itself came up, except the parts which try to hit the load balanced control-plane endpoint.
TLDR: The haproxy load balancer did not forward traffic to the control plane node.
My current theory is:
- CAPD wrote the new haproxy config file to the lb container, which includes the first control plane node as backend
- I can confirm that this one was correct in my setup by reading the file again using `docker run
cp
- I can confirm that this one was correct in my setup by reading the file again using `docker run
- CAPD did signal to haproxy to reload the config (by sending
SIGHUP) - Afterwards CAPD did still not route requests to the running node.
I was able to "fix" the issue in this case by again sending SIGHUP to haproxy: docker kill -s SIGHUP <container>
Afterwards haproxy did route requests to the running node.
I'm currently testing the following fix locally which is: reading and comparing the configfile in CAPD after writing it and before reloading haproxy:
- #10453
Test setup:
GINKGO_FOCUS="PR-Blocking"
GINKGO_SKIP="\[Conformance\]"
So it only runs a single test.
I used prowjob on kind to create a kind cluster and pod yaml, which I then modified (adjusted timeouts + GINKGO_FOCUS + requests, propably other things too).
I then run the loop using ./scripts/test-pj.sh.
All code changes for my setup are here for reference: https://github.com/kubernetes-sigs/cluster-api/commit/dfe9d5e2478dbdc73b59be5533ea59234c8d2a9c
I did some optimisations, like packing all required images to scripts/images.tar and loading them from there instead of building + propably some others to make it faster and rely less on internet to not run into rate limiting stuff too.
Fixes are merged, let's check next week or so if the error occurs again.
The merged fix did not help.
For reference, I did hit the same issue (CAPD load balancer config not active) as described in this comment on a 0.4 => 1.6 => current upgrade test but with a slightly different log:
STEP: Initializing the workload cluster with older versions of providers @ 04/24/24 08:03:54.723
INFO: clusterctl init --config /logs/artifacts/repository/clusterctl-config.v1.2.yaml --kubeconfig /tmp/e2e-kubeconfig2940028556 --wait-providers --core cluster-api:v0.4.8 --bootstrap kubeadm:v0.4.8 --control-plane kubeadm:v0.4.8 --infrastructure docker:v0.4.8
INFO: Waiting for provider controllers to be running
STEP: Waiting for deployment capd-system/capd-controller-manager to be available @ 04/24/24 08:04:31.768
INFO: Creating log watcher for controller capd-system/capd-controller-manager, pod capd-controller-manager-7cb759f76b-whdwb, container manager
STEP: Waiting for deployment capi-kubeadm-bootstrap-system/capi-kubeadm-bootstrap-controller-manager to be available @ 04/24/24 08:04:31.892
INFO: Creating log watcher for controller capi-kubeadm-bootstrap-system/capi-kubeadm-bootstrap-controller-manager, pod capi-kubeadm-bootstrap-controller-manager-b67d5f4cb-8kppj, container manager
STEP: Waiting for deployment capi-kubeadm-control-plane-system/capi-kubeadm-control-plane-controller-manager to be available @ 04/24/24 08:04:31.918
INFO: Creating log watcher for controller capi-kubeadm-control-plane-system/capi-kubeadm-control-plane-controller-manager, pod capi-kubeadm-control-plane-controller-manager-69846d766d-k8ptj, container manager
STEP: Waiting for deployment capi-system/capi-controller-manager to be available @ 04/24/24 08:04:31.95
INFO: Creating log watcher for controller capi-system/capi-controller-manager, pod capi-controller-manager-7c9ccb586-5mkbx, container manager
STEP: THE MANAGEMENT CLUSTER WITH THE OLDER VERSION OF PROVIDERS IS UP&RUNNING! @ 04/24/24 08:04:32.176
STEP: Creating a namespace for hosting the clusterctl-upgrade test workload cluster @ 04/24/24 08:04:32.177
INFO: Creating namespace clusterctl-upgrade
INFO: Creating event watcher for namespace "clusterctl-upgrade"
STEP: Creating a test workload cluster @ 04/24/24 08:04:32.193
INFO: Creating the workload cluster with name "clusterctl-upgrade-o3zf09" using the "(default)" template (Kubernetes v1.23.17, 1 control-plane machines, 1 worker machines)
INFO: Getting the cluster template yaml
INFO: clusterctl config cluster clusterctl-upgrade-o3zf09 --infrastructure docker --kubernetes-version v1.23.17 --control-plane-machine-count 1 --worker-machine-count 1 --flavor (default)
INFO: Applying the cluster template yaml to the cluster
STEP: Waiting for the machines to exist @ 04/24/24 08:04:44.941
[FAILED] in [It] - /home/prow/go/src/sigs.k8s.io/cluster-api/test/e2e/clusterctl_upgrade.go:456 @ 04/24/24 08:09:44.953
STEP: Dumping logs from the "clusterctl-upgrade-hs16jr" workload cluster @ 04/24/24 08:09:44.953
[FAILED] in [AfterEach] - /home/prow/go/src/sigs.k8s.io/cluster-api/test/framework/cluster_proxy.go:311 @ 04/24/24 08:12:44.955
<< Timeline
[FAILED] Timed out after 300.001s.
Timed out waiting for all Machines to exist
Expected
<int64>: 0
to equal
<int64>: 2
In [It] at: /home/prow/go/src/sigs.k8s.io/cluster-api/test/e2e/clusterctl_upgrade.go:456 @ 04/24/24 08:09:44.953
Full Stack Trace
sigs.k8s.io/cluster-api/test/e2e.ClusterctlUpgradeSpec.func2()
/home/prow/go/src/sigs.k8s.io/cluster-api/test/e2e/clusterctl_upgrade.go:456 +0x28cd
I'll investigate more on this issue. /assign
I just checked this flake again and it seems it now occur only in 1.5, and most probably it is also related to CAPD
Considering this, I'm going to close for now /close
@fabriziopandini: Closing this issue.
In response to this:
I just checked this flake again and it seems it now occur only in 1.5, and most probably it is also related to CAPD
Considering this, I'm going to close for now /close
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.