support-bundle-kit Bundle generated without `yamls` directory

Support bundle doesn't have the yamls directory. On https://github.com/harvester/harvester/issues/8224, 3 out of 6 bundles collected so far don't have the yamls directory. For all of these bundles, the contents of bundleGerationErrorLog file are exactly similar:

Unable to fetch namespaced resources: unable to retrieve the complete list of server APIs: ext.cattle.io/v1: stale GroupVersion discovery: ext.cattle.io/v1
Unable to fetch cluster resources: unable to retrieve the complete list of server APIs: ext.cattle.io/v1: stale GroupVersion discovery: ext.cattle.io/v1
Unable to fetch namespaced resources: unable to retrieve the complete list of server APIs: ext.cattle.io/v1: stale GroupVersion discovery: ext.cattle.io/v1
Unable to fetch namespaced resources: unable to retrieve the complete list of server APIs: ext.cattle.io/v1: stale GroupVersion discovery: ext.cattle.io/v1
Unable to fetch namespaced resources: unable to retrieve the complete list of server APIs: ext.cattle.io/v1: stale GroupVersion discovery: ext.cattle.io/v1
Unable to fetch namespaced resources: unable to retrieve the complete list of server APIs: ext.cattle.io/v1: stale GroupVersion discovery: ext.cattle.io/v1
Unable to fetch namespaced resources: unable to retrieve the complete list of server APIs: ext.cattle.io/v1: stale GroupVersion discovery: ext.cattle.io/v1
Unable to fetch namespaced resources: unable to retrieve the complete list of server APIs: ext.cattle.io/v1: stale GroupVersion discovery: ext.cattle.io/v1
Unable to fetch namespaced resources: unable to retrieve the complete list of server APIs: ext.cattle.io/v1: stale GroupVersion discovery: ext.cattle.io/v1
Unable to fetch namespaced resources: unable to retrieve the complete list of server APIs: ext.cattle.io/v1: stale GroupVersion discovery: ext.cattle.io/v1
Unable to fetch namespaced resources: unable to retrieve the complete list of server APIs: ext.cattle.io/v1: stale GroupVersion discovery: ext.cattle.io/v1
Unable to fetch namespaced resources: unable to retrieve the complete list of server APIs: ext.cattle.io/v1: stale GroupVersion discovery: ext.cattle.io/v1
Unable to fetch namespaced resources: unable to retrieve the complete list of server APIs: ext.cattle.io/v1: stale GroupVersion discovery: ext.cattle.io/v1
Unable to fetch namespaced resources: unable to retrieve the complete list of server APIs: ext.cattle.io/v1: stale GroupVersion discovery: ext.cattle.io/v1
Unable to fetch namespaced resources: unable to retrieve the complete list of server APIs: ext.cattle.io/v1: stale GroupVersion discovery: ext.cattle.io/v1
Unable to fetch namespaced resources: unable to retrieve the complete list of server APIs: ext.cattle.io/v1: stale GroupVersion discovery: ext.cattle.io/v1
BUG: Support bundle: cannot get log for pod etcd-<node-name> container etcd: previous terminated container "etcd" in pod "etcd-<node-name>" not found
BUG: Support bundle: cannot get log for pod etcd-<node-name> container etcd: previous terminated container "etcd" in pod "etcd-<node-name>" not found
BUG: Support bundle: cannot get log for pod etcd-<node-name> container etcd: previous terminated container "etcd" in pod "etcd-<node-name>" not found
BUG: Support bundle: cannot get log for pod kube-apiserver-<node-name> container kube-apiserver: previous terminated container "kube-apiserver" in pod "kube-apiserver-<node-name>" not found
BUG: Support bundle: cannot get log for pod kube-apiserver-<node-name> container kube-apiserver: previous terminated container "kube-apiserver" in pod "kube-apiserver-<node-name>" not found
BUG: Support bundle: cannot get log for pod kube-apiserver-<node-name> container kube-apiserver: previous terminated container "kube-apiserver" in pod "kube-apiserver-<node-name>" not found

Looking at the bundle's logs/harvester-system/supportbundle-manager-bundle-xxxxx-xxxxxxxx-xxxxx/manager.log isn't very helpful either as it contains same messages as above, but in a different format. E.g., following line in bundleGenerationError.log:

Unable to fetch namespaced resources: unable to retrieve the complete list of server APIs: ext.cattle.io/v1: stale GroupVersion discovery: ext.cattle.io/v1

shows up as below in the pod log as:

2025-05-16T06:15:17.543676322Z time="2025-05-16T06:15:17Z" level=error msg="Unable to fetch namespaced resources" error="unable to retrieve the complete list of server APIs: ext.cattle.io/v1: stale GroupVersion discovery: ext.cattle.io/v1"

I looked at #99 before opening this issue, but that seems more about logging errors into bundleGenerationErrorLog file, while this is about the cause of missing yamls being unclear from the logs.

May 19 '25 08:05 dharmit

I think the stale GroupVersion discovery error normally happens when the associated aggregated api server is down. On Harvester, the v1.ext.cattle.io api group version is associated with the cattle-system/imperative-api-extension service, which is backed by the rancher deployment:

$ k get apiservices v1.ext.cattle.io
NAME               SERVICE                                  AVAILABLE   AGE
v1.ext.cattle.io   cattle-system/imperative-api-extension   True        13d

$ k -n cattle-system get svc/imperative-api-extension -oyaml | yq .spec.selector
app: rancher

FWIW, the SB from #8224 indicated that the cattle-system/rancher service has been down for a while:

$ less prometheus-alerts.json
# <snip>
    227         {
    228                 "activeAt": "2025-05-01T18:46:04Z",
    229                 "Annotations": {
    230                         "description": "100% of the rancher/rancher targets in cattle-system namespace are down.",
    231                         "runbook_url": "https://runbooks.prometheus-operator.dev/runbooks/general/targetdown",
    232                         "summary": "One or more targets are unreachable."
    233                 },
    234                 "Labels": {
    235                         "alertname": "TargetDown",
    236                         "job": "rancher",
    237                         "namespace": "cattle-system",
    238                         "service": "rancher",
    239                         "severity": "warning"
    240                 },
    241                 "State": "firing",
    242                 "Value": "1e+02"
    243         },

May 21 '25 23:05 ihcsim

aggregated api server

TIL. I hadn't heard of this before. Thanks!

FWIW, the SB from #8224 indicated that the cattle-system/rancher service has been down for a while

But in one of the SB, I do see yamls directory in spite of same message about 100% of rancher/rancher targets being down. How could that SB have the yamls directory if it is the reason behind them not getting collected? 🤔

And, weirdly, in the latest SB that I was able to load, I don't see the specific aggregated api server running:

$ k get apiservices --no-headers| wc -l
69

$ k get apiservices v1.ext.cattle.io
Error from server (NotFound): apiservices.apiregistration.k8s.io "v1.ext.cattle.io" not found

$ k get svc imperative-api-extension
NAME                       TYPE        CLUSTER-IP    EXTERNAL-IP   PORT(S)    AGE
imperative-api-extension   ClusterIP   10.53.5.168   <none>        6666/TCP   21d

$ k get pods -l app=rancher
NAME                       READY   STATUS    RESTARTS   AGE
rancher-6f888f8789-27jw9   1/1     Running   0          15d
rancher-6f888f8789-lkp5t   1/1     Running   0          14d
rancher-6f888f8789-q9jqq   1/1     Running   0          15d

# in spite of things looking OK, the prometheus alert is firing
$ cat prometheus-alerts.json | jq ' .[] | select(.Value=="1e+02")'
{
  "activeAt": "2025-05-01T18:46:04Z",
  "Annotations": {
    "description": "100% of the rancher/rancher targets in cattle-system namespace are down.",
    "runbook_url": "https://runbooks.prometheus-operator.dev/runbooks/general/targetdown",
    "summary": "One or more targets are unreachable."
  },
  "Labels": {
    "alertname": "TargetDown",
    "job": "rancher",
    "namespace": "cattle-system",
    "service": "rancher",
    "severity": "warning"
  },
  "State": "firing",
  "Value": "1e+02"
}

Is this behaviour expected?

May 23 '25 09:05 dharmit

I think the missing apiservices/v1.ext.cattle.io would explain the api discovery error:

Unable to fetch namespaced resources: unable to retrieve the complete list of server APIs: ext.cattle.io/v1: stale GroupVersion discovery: ext.cattle.io/v1

The next step is to determine why it's missing. Did this API exist in 1.4.2? Did rancher version change between 1.4.2 to 1.5.0 causing this APIService to be deleted (but fail to be recreated) somehow during an upgrade? Maybe also see if there are similar known Rancher issues.

I don't know if this caused the upgrade to fail, or the other way round where the upgrade failure caused this to happen.

May 23 '25 16:05 ihcsim