cluster-api
cluster-api copied to clipboard
[E2E Framework] Improve E2E Framework to Collect Debug Artifacts on clusterctl init Failures
What would you like to be added (User Story)?
As a developer, I would like to access the debug info in artifacts on clusterctl init failures.
Detailed Description
Currently, when running E2E tests, if clusterctl init fails (e.g., due to CAPI components not reaching a Ready status as below), the test artifacts do not contain sufficient information to debug. This lack of context makes it difficult to identify root causes especially when we couldn't access infra.
We need e2eframework to collect the relevant diagnostic data when clusterctl init or early cluster bootstrap steps fail.
INFO: The kubeconfig file for the kind cluster is /tmp/e2e-kind3738542185
STEP: Initialize bootstrap cluster @ 05/07/25 06:16:33.625
INFO: clusterctl init --config /tmp/tmp.I8vcxeUaCN/repository/clusterctl-config.yaml --kubeconfig /tmp/e2e-kind3738542185 --wait-providers --core cluster-api --bootstrap kubeadm --control-plane kubeadm --infrastructure vsphere
[FAILED] in [SynchronizedBeforeSuite] - /home/prow/go/pkg/mod/sigs.k8s.io/cluster-api/[email protected]/framework/clusterctl/client.go:90 @ 05/07/25 06:22:18.911
[SynchronizedBeforeSuite] [FAILED] [396.063 seconds]
[SynchronizedBeforeSuite]
/home/prow/go/src/k8s.io/cloud-provider-vsphere/test/e2e/e2e_suite_test.go:143
[FAILED] failed to run clusterctl init
Unexpected error:
<*errors.withStack | 0xc0000108e8>:
deployment "capi-controller-manager" is not ready after 5m0s: context deadline exceeded
{
error: <*errors.withMessage | 0xc0022e6680>{
cause: <context.deadlineExceededError>{},
msg: "deployment \"capi-controller-manager\" is not ready after 5m0s",
},
stack: [0x24b[807](http://52.34.10.152:30002/view/s3/prow-logs/pr-logs/pull/team-cluster-api_cloud-provider-vsphere/89/pull-cloud-provider-vsphere-e2e-test/1919998318331564032#1:build-log.txt%3A807)0, 0x24b7da7, 0x24b74b4, 0x24fa1ce, 0x259ce8a, 0x259f64b, 0x264e3f3, 0x19aa3a2, 0x19bae16, 0x264d479, 0x5029c6, 0x501ad9, 0x199eede, 0x19af1ce, 0x19b29fb, 0x4841a1],
}
occurred
In [SynchronizedBeforeSuite] at: /home/prow/go/pkg/mod/sigs.k8s.io/cluster-api/[email protected]/framework/clusterctl/client.go:90 @ 05/07/25 06:22:18.911
Anything else you would like to add?
No response
Label(s) to be applied
/kind feature One or more /area label. See https://github.com/kubernetes-sigs/cluster-api/labels?q=area for the list of labels.
/triage accepted /priority backlog
/help
/assign
Hi @chrischdi , I had gone the E2E Framework code What I can understand is, we are printing the logs on console and picking from console (stderr or stdout ) and writing to log file. https://github.com/kubernetes-sigs/cluster-api/blob/main/test/framework/clusterctl/client.go#L94 Can you please provide more info what more debug logs we want to print on console ? Thanks
Sorry for the late reply.
Not 100% sure what the best fit would be.
We should double check if:
- in this case we already dump the pod.yaml
- in this case we already grab
/var/log/podsfrom the node's of the management cluster
Otherwise I'd instead try to figure out what would have been necessary the next time this occurs.
Hi @chrischdi
in this case we already dump the pod.yaml in this case we already grab /var/log/pods from the node's of the management cluster
I will verify these two things first in the code and will get back.
Thank you