cluster-api [E2E Framework] Improve E2E Framework to Collect Debug Artifacts on clusterctl init Failures

What would you like to be added (User Story)?

As a developer, I would like to access the debug info in artifacts on clusterctl init failures.

Detailed Description

Currently, when running E2E tests, if clusterctl init fails (e.g., due to CAPI components not reaching a Ready status as below), the test artifacts do not contain sufficient information to debug. This lack of context makes it difficult to identify root causes especially when we couldn't access infra.

We need e2eframework to collect the relevant diagnostic data when clusterctl init or early cluster bootstrap steps fail.

INFO: The kubeconfig file for the kind cluster is /tmp/e2e-kind3738542185
  STEP: Initialize bootstrap cluster @ 05/07/25 06:16:33.625
  INFO: clusterctl init --config /tmp/tmp.I8vcxeUaCN/repository/clusterctl-config.yaml --kubeconfig /tmp/e2e-kind3738542185 --wait-providers --core cluster-api --bootstrap kubeadm --control-plane kubeadm --infrastructure vsphere
  [FAILED] in [SynchronizedBeforeSuite] - /home/prow/go/pkg/mod/sigs.k8s.io/cluster-api/[email protected]/framework/clusterctl/client.go:90 @ 05/07/25 06:22:18.911
[SynchronizedBeforeSuite] [FAILED] [396.063 seconds]
[SynchronizedBeforeSuite] 
/home/prow/go/src/k8s.io/cloud-provider-vsphere/test/e2e/e2e_suite_test.go:143
  [FAILED] failed to run clusterctl init
  Unexpected error:
      <*errors.withStack | 0xc0000108e8>: 
      deployment "capi-controller-manager" is not ready after 5m0s: context deadline exceeded
      {
          error: <*errors.withMessage | 0xc0022e6680>{
              cause: <context.deadlineExceededError>{},
              msg: "deployment \"capi-controller-manager\" is not ready after 5m0s",
          },
          stack: [0x24b[807](http://52.34.10.152:30002/view/s3/prow-logs/pr-logs/pull/team-cluster-api_cloud-provider-vsphere/89/pull-cloud-provider-vsphere-e2e-test/1919998318331564032#1:build-log.txt%3A807)0, 0x24b7da7, 0x24b74b4, 0x24fa1ce, 0x259ce8a, 0x259f64b, 0x264e3f3, 0x19aa3a2, 0x19bae16, 0x264d479, 0x5029c6, 0x501ad9, 0x199eede, 0x19af1ce, 0x19b29fb, 0x4841a1],
      }
  occurred
  In [SynchronizedBeforeSuite] at: /home/prow/go/pkg/mod/sigs.k8s.io/cluster-api/[email protected]/framework/clusterctl/client.go:90 @ 05/07/25 06:22:18.911

Anything else you would like to add?

No response

Label(s) to be applied

/kind feature One or more /area label. See https://github.com/kubernetes-sigs/cluster-api/labels?q=area for the list of labels.

May 07 '25 08:05 zhanggbj

/triage accepted /priority backlog

May 14 '25 13:05 chrischdi

/help

May 14 '25 14:05 chrischdi

/assign

May 20 '25 13:05 arshadd-b

Hi @chrischdi , I had gone the E2E Framework code What I can understand is, we are printing the logs on console and picking from console (stderr or stdout ) and writing to log file. https://github.com/kubernetes-sigs/cluster-api/blob/main/test/framework/clusterctl/client.go#L94 Can you please provide more info what more debug logs we want to print on console ? Thanks

May 23 '25 13:05 arshadd-b

Sorry for the late reply.

Not 100% sure what the best fit would be.

We should double check if:

in this case we already dump the pod.yaml
in this case we already grab /var/log/pods from the node's of the management cluster

Otherwise I'd instead try to figure out what would have been necessary the next time this occurs.

Jun 17 '25 09:06 chrischdi

Hi @chrischdi

in this case we already dump the pod.yaml in this case we already grab /var/log/pods from the node's of the management cluster

I will verify these two things first in the code and will get back.

Thank you

Jul 01 '25 07:07 arshadd-b

Hi @chrischdi
I can see we are collecting logs for /var/log/pods here

for this case in this case we already dump the pod.yaml I can see inside this function DumpAllResourcesAndLogs we are collecting here

So I think we are already collecting both.

Jul 11 '25 13:07 arshadd-b

cluster-api cluster-api copied to clipboard

[E2E Framework] Improve E2E Framework to Collect Debug Artifacts on clusterctl init Failures

What would you like to be added (User Story)?

Detailed Description

Anything else you would like to add?

Label(s) to be applied

cluster-api
cluster-api copied to clipboard