Bundle generated without `yamls` directory
Support bundle doesn't have the yamls directory. On https://github.com/harvester/harvester/issues/8224, 3 out of 6 bundles collected so far don't have the yamls directory. For all of these bundles, the contents of bundleGerationErrorLog file are exactly similar:
Unable to fetch namespaced resources: unable to retrieve the complete list of server APIs: ext.cattle.io/v1: stale GroupVersion discovery: ext.cattle.io/v1
Unable to fetch cluster resources: unable to retrieve the complete list of server APIs: ext.cattle.io/v1: stale GroupVersion discovery: ext.cattle.io/v1
Unable to fetch namespaced resources: unable to retrieve the complete list of server APIs: ext.cattle.io/v1: stale GroupVersion discovery: ext.cattle.io/v1
Unable to fetch namespaced resources: unable to retrieve the complete list of server APIs: ext.cattle.io/v1: stale GroupVersion discovery: ext.cattle.io/v1
Unable to fetch namespaced resources: unable to retrieve the complete list of server APIs: ext.cattle.io/v1: stale GroupVersion discovery: ext.cattle.io/v1
Unable to fetch namespaced resources: unable to retrieve the complete list of server APIs: ext.cattle.io/v1: stale GroupVersion discovery: ext.cattle.io/v1
Unable to fetch namespaced resources: unable to retrieve the complete list of server APIs: ext.cattle.io/v1: stale GroupVersion discovery: ext.cattle.io/v1
Unable to fetch namespaced resources: unable to retrieve the complete list of server APIs: ext.cattle.io/v1: stale GroupVersion discovery: ext.cattle.io/v1
Unable to fetch namespaced resources: unable to retrieve the complete list of server APIs: ext.cattle.io/v1: stale GroupVersion discovery: ext.cattle.io/v1
Unable to fetch namespaced resources: unable to retrieve the complete list of server APIs: ext.cattle.io/v1: stale GroupVersion discovery: ext.cattle.io/v1
Unable to fetch namespaced resources: unable to retrieve the complete list of server APIs: ext.cattle.io/v1: stale GroupVersion discovery: ext.cattle.io/v1
Unable to fetch namespaced resources: unable to retrieve the complete list of server APIs: ext.cattle.io/v1: stale GroupVersion discovery: ext.cattle.io/v1
Unable to fetch namespaced resources: unable to retrieve the complete list of server APIs: ext.cattle.io/v1: stale GroupVersion discovery: ext.cattle.io/v1
Unable to fetch namespaced resources: unable to retrieve the complete list of server APIs: ext.cattle.io/v1: stale GroupVersion discovery: ext.cattle.io/v1
Unable to fetch namespaced resources: unable to retrieve the complete list of server APIs: ext.cattle.io/v1: stale GroupVersion discovery: ext.cattle.io/v1
Unable to fetch namespaced resources: unable to retrieve the complete list of server APIs: ext.cattle.io/v1: stale GroupVersion discovery: ext.cattle.io/v1
BUG: Support bundle: cannot get log for pod etcd-<node-name> container etcd: previous terminated container "etcd" in pod "etcd-<node-name>" not found
BUG: Support bundle: cannot get log for pod etcd-<node-name> container etcd: previous terminated container "etcd" in pod "etcd-<node-name>" not found
BUG: Support bundle: cannot get log for pod etcd-<node-name> container etcd: previous terminated container "etcd" in pod "etcd-<node-name>" not found
BUG: Support bundle: cannot get log for pod kube-apiserver-<node-name> container kube-apiserver: previous terminated container "kube-apiserver" in pod "kube-apiserver-<node-name>" not found
BUG: Support bundle: cannot get log for pod kube-apiserver-<node-name> container kube-apiserver: previous terminated container "kube-apiserver" in pod "kube-apiserver-<node-name>" not found
BUG: Support bundle: cannot get log for pod kube-apiserver-<node-name> container kube-apiserver: previous terminated container "kube-apiserver" in pod "kube-apiserver-<node-name>" not found
Looking at the bundle's logs/harvester-system/supportbundle-manager-bundle-xxxxx-xxxxxxxx-xxxxx/manager.log isn't very helpful either as it contains same messages as above, but in a different format. E.g., following line in bundleGenerationError.log:
Unable to fetch namespaced resources: unable to retrieve the complete list of server APIs: ext.cattle.io/v1: stale GroupVersion discovery: ext.cattle.io/v1
shows up as below in the pod log as:
2025-05-16T06:15:17.543676322Z time="2025-05-16T06:15:17Z" level=error msg="Unable to fetch namespaced resources" error="unable to retrieve the complete list of server APIs: ext.cattle.io/v1: stale GroupVersion discovery: ext.cattle.io/v1"
I looked at #99 before opening this issue, but that seems more about logging errors into bundleGenerationErrorLog file, while this is about the cause of missing yamls being unclear from the logs.
I think the stale GroupVersion discovery error normally happens when the associated aggregated api server is down. On Harvester, the v1.ext.cattle.io api group version is associated with the cattle-system/imperative-api-extension service, which is backed by the rancher deployment:
$ k get apiservices v1.ext.cattle.io
NAME SERVICE AVAILABLE AGE
v1.ext.cattle.io cattle-system/imperative-api-extension True 13d
$ k -n cattle-system get svc/imperative-api-extension -oyaml | yq .spec.selector
app: rancher
FWIW, the SB from #8224 indicated that the cattle-system/rancher service has been down for a while:
$ less prometheus-alerts.json
# <snip>
227 {
228 "activeAt": "2025-05-01T18:46:04Z",
229 "Annotations": {
230 "description": "100% of the rancher/rancher targets in cattle-system namespace are down.",
231 "runbook_url": "https://runbooks.prometheus-operator.dev/runbooks/general/targetdown",
232 "summary": "One or more targets are unreachable."
233 },
234 "Labels": {
235 "alertname": "TargetDown",
236 "job": "rancher",
237 "namespace": "cattle-system",
238 "service": "rancher",
239 "severity": "warning"
240 },
241 "State": "firing",
242 "Value": "1e+02"
243 },
aggregated api server
TIL. I hadn't heard of this before. Thanks!
FWIW, the SB from #8224 indicated that the
cattle-system/rancherservice has been down for a while
But in one of the SB, I do see yamls directory in spite of same message about 100% of rancher/rancher targets being down. How could that SB have the yamls directory if it is the reason behind them not getting collected? 🤔
And, weirdly, in the latest SB that I was able to load, I don't see the specific aggregated api server running:
$ k get apiservices --no-headers| wc -l
69
$ k get apiservices v1.ext.cattle.io
Error from server (NotFound): apiservices.apiregistration.k8s.io "v1.ext.cattle.io" not found
$ k get svc imperative-api-extension
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
imperative-api-extension ClusterIP 10.53.5.168 <none> 6666/TCP 21d
$ k get pods -l app=rancher
NAME READY STATUS RESTARTS AGE
rancher-6f888f8789-27jw9 1/1 Running 0 15d
rancher-6f888f8789-lkp5t 1/1 Running 0 14d
rancher-6f888f8789-q9jqq 1/1 Running 0 15d
# in spite of things looking OK, the prometheus alert is firing
$ cat prometheus-alerts.json | jq ' .[] | select(.Value=="1e+02")'
{
"activeAt": "2025-05-01T18:46:04Z",
"Annotations": {
"description": "100% of the rancher/rancher targets in cattle-system namespace are down.",
"runbook_url": "https://runbooks.prometheus-operator.dev/runbooks/general/targetdown",
"summary": "One or more targets are unreachable."
},
"Labels": {
"alertname": "TargetDown",
"job": "rancher",
"namespace": "cattle-system",
"service": "rancher",
"severity": "warning"
},
"State": "firing",
"Value": "1e+02"
}
Is this behaviour expected?
I think the missing apiservices/v1.ext.cattle.io would explain the api discovery error:
Unable to fetch namespaced resources: unable to retrieve the complete list of server APIs: ext.cattle.io/v1: stale GroupVersion discovery: ext.cattle.io/v1
The next step is to determine why it's missing. Did this API exist in 1.4.2? Did rancher version change between 1.4.2 to 1.5.0 causing this APIService to be deleted (but fail to be recreated) somehow during an upgrade? Maybe also see if there are similar known Rancher issues.
I don't know if this caused the upgrade to fail, or the other way round where the upgrade failure caused this to happen.