troubleshoot
troubleshoot copied to clipboard
Troubleshoot stops collecting logs when it encounters a pod in `Shutdown` state
Problem to solve
Starting with Kubernetes 1.21, the GracefulNodeShutdown feature gate defaults to true:
https://kubernetes.io/docs/reference/command-line-tools-reference/feature-gates/
This feature causes Pods to be retained in a Shutdown state when a single-node cluster is rebooted where the pod is still present but no containers are running. When a Troubleshoot Pod Logs collector matches one of these Shutdown pods, the entire collector aborts without collecting any data.
Proposal
Troubleshoot should collect logs from all Pods that match configured selectors without halting if a single pod is present in the Shutdown state.
Reproduction Case
-
Acquire a Ubuntu 20.04 VM with 8 CPUs and 32 GB of RAM.
-
Generate kURL installer using kubernetes 1.21 and KOTS 1.76.0 and install it in airgap mode:
cat <<'EOF' > installer.yaml
apiVersion: "cluster.kurl.sh/v1beta1"
kind: "Installer"
metadata:
name: "fe352e0"
spec:
kubernetes:
version: "1.21.x"
weave:
version: "2.6.x"
contour:
version: "1.21.x"
containerd:
version: "1.5.x"
velero:
version: "1.9.0"
kotsadm:
version: "1.76.0"
ekco:
version: "latest"
minio:
version: "2022-07-06T20-29-49Z"
openebs:
version: "latest"
isLocalPVEnabled: true
localPVStorageClassName: "default"
EOF
installer_hash=$(curl -s -X POST -H "Content-Type: text/yaml" --data-binary "@installer.yaml" https://kurl.sh/installer |grep -o "[^/]*$")
curl -LO https://kurl.sh/bundle/"${installer_hash}".tar.gz
tar xzf "${installer_hash}.tar.gz"
cat install.sh | bash -s airgap
-
Reboot the note after installation finishes:
systemctl reboot -
Post reboot, there will be
kotsadmpods in theShutdownstate:
# kubectl get pods
NAME READY STATUS RESTARTS AGE
kotsadm-5fdff6554-gqg96 0/1 Shutdown 0 7m52s
kotsadm-5fdff6554-nntjj 1/1 Running 0 5m18s
kotsadm-postgres-0 1/1 Running 0 5m18s
kurl-proxy-kotsadm-88448b447-865c5 0/1 Shutdown 0 7m50s
kurl-proxy-kotsadm-88448b447-p9drk 1/1 Running 0 5m16s
- Configure a support bundle to collect logs from all
kotsadmpods:
cat <<'EOF' > kotsadm-logs.yaml
apiVersion: troubleshoot.sh/v1beta2
kind: SupportBundle
metadata:
name: sample
spec:
collectors:
- logs:
selector:
- "kots.io/kotsadm=true"
name: "log-test"
EOF
- Run the collector and inspect the output:
/usr/local/bin/kubectl-support_bundle -n default ./kotsadm-logs.yaml
Outcome
The collector prints a warning message about being unable to locate containers and produces an archive with no data in the log-test folder:
# /usr/local/bin/kubectl-support_bundle version
Replicated Troubleshoot 0.37.1
# /usr/local/bin/kubectl-support_bundle -n default ./kotsadm-logs.yaml
* failed to run collector "logs/kots.io/kotsadm=true": failed to get log stream: container "kotsadm" in pod "kotsadm-5fdff6554-gqg96" is not available
Collecting support bundle ⠼ cluster-resources
support-bundle-2022-07-14T16_30_01.tar.gz
# tar tzf support-bundle-2022-07-14T16_30_01.tar.gz | grep --count log-test
0
Expected Outcome
The support bundle includes logs from all Pods matching the selector, as is the case when the Shutdown pods are removed:
# for ns in $(kubectl get ns| awk 'FNR > 1 {print $1}'); do
printf 'Cleaning namespace: %s\n' "${ns}"
kubectl get pods -n "${ns}" | awk '$3 ~ /Shutdown/ { print $1 }' | xargs --no-run-if-empty kubectl delete pods -n "${ns}"
done
Cleaning namespace: default
pod "kotsadm-5fdff6554-gqg96" deleted
pod "kurl-proxy-kotsadm-88448b447-865c5" deleted
Cleaning namespace: kube-node-lease
Cleaning namespace: kube-public
Cleaning namespace: kube-system
Cleaning namespace: kurl
pod "ekc-operator-5b5cffd645-s2hzk" deleted
Cleaning namespace: minio
pod "minio-598fdcc66d-nxhfd" deleted
Cleaning namespace: openebs
Cleaning namespace: projectcontour
pod "contour-6d57f96f7b-dfjbn" deleted
pod "contour-6d57f96f7b-xb59z" deleted
Cleaning namespace: velero
pod "velero-547b94768f-z78lj" deleted
# /usr/local/bin/kubectl-support_bundle -n default ./kotsadm-logs.yaml
Collecting support bundle ⠹ cluster-resources
support-bundle-2022-07-14T16_45_43.tar.gz
# tar tzf support-bundle-2022-07-14T16_45_43.tar.gz | grep log-test
support-bundle-2022-07-14T16_45_43/log-test/kotsadm-5fdff6554-nntjj/schemahero-apply.log
support-bundle-2022-07-14T16_45_43/log-test/kotsadm-5fdff6554-nntjj/restore-db.log
support-bundle-2022-07-14T16_45_43/log-test/kotsadm-postgres-0.log
support-bundle-2022-07-14T16_45_43/log-test/kotsadm-5fdff6554-nntjj/schemahero-plan.log
support-bundle-2022-07-14T16_45_43/log-test/kotsadm-5fdff6554-nntjj/kotsadm.log
support-bundle-2022-07-14T16_45_43/log-test/kotsadm-5fdff6554-nntjj/restore-s3.log
support-bundle-2022-07-14T16_45_43/log-test/kurl-proxy-kotsadm-88448b447-p9drk.log
@adamancini are you confirming that #643 resolves this?
@chris-sanders yeah I was able to build & test it today and #643 resolves the problem nicely.