troubleshoot Troubleshoot stops collecting logs when it encounters a pod in `Shutdown` state

Problem to solve

Starting with Kubernetes 1.21, the GracefulNodeShutdown feature gate defaults to true:

https://kubernetes.io/docs/reference/command-line-tools-reference/feature-gates/

This feature causes Pods to be retained in a Shutdown state when a single-node cluster is rebooted where the pod is still present but no containers are running. When a Troubleshoot Pod Logs collector matches one of these Shutdown pods, the entire collector aborts without collecting any data.

Proposal

Troubleshoot should collect logs from all Pods that match configured selectors without halting if a single pod is present in the Shutdown state.

Reproduction Case

Acquire a Ubuntu 20.04 VM with 8 CPUs and 32 GB of RAM.
Generate kURL installer using kubernetes 1.21 and KOTS 1.76.0 and install it in airgap mode:

cat <<'EOF' > installer.yaml
apiVersion: "cluster.kurl.sh/v1beta1"
kind: "Installer"
metadata: 
  name: "fe352e0"
spec: 
  kubernetes: 
    version: "1.21.x"
  weave: 
    version: "2.6.x"
  contour: 
    version: "1.21.x"
  containerd: 
    version: "1.5.x"
  velero: 
    version: "1.9.0"
  kotsadm: 
    version: "1.76.0"
  ekco: 
    version: "latest"
  minio: 
    version: "2022-07-06T20-29-49Z"
  openebs: 
    version: "latest"
    isLocalPVEnabled: true
    localPVStorageClassName: "default"
EOF

installer_hash=$(curl -s -X POST -H "Content-Type: text/yaml" --data-binary "@installer.yaml" https://kurl.sh/installer |grep -o "[^/]*$")

curl -LO https://kurl.sh/bundle/"${installer_hash}".tar.gz
tar xzf "${installer_hash}.tar.gz"
cat install.sh | bash -s airgap

Reboot the note after installation finishes: systemctl reboot
Post reboot, there will be kotsadm pods in the Shutdown state:

# kubectl get pods
NAME                                 READY   STATUS     RESTARTS   AGE
kotsadm-5fdff6554-gqg96              0/1     Shutdown   0          7m52s
kotsadm-5fdff6554-nntjj              1/1     Running    0          5m18s
kotsadm-postgres-0                   1/1     Running    0          5m18s
kurl-proxy-kotsadm-88448b447-865c5   0/1     Shutdown   0          7m50s
kurl-proxy-kotsadm-88448b447-p9drk   1/1     Running    0          5m16s

Configure a support bundle to collect logs from all kotsadm pods:

cat <<'EOF' > kotsadm-logs.yaml
apiVersion: troubleshoot.sh/v1beta2
kind: SupportBundle
metadata:
  name: sample
spec:
  collectors:
    - logs:
        selector:
          - "kots.io/kotsadm=true"
        name: "log-test"
EOF

Run the collector and inspect the output:

/usr/local/bin/kubectl-support_bundle -n default ./kotsadm-logs.yaml

Outcome

The collector prints a warning message about being unable to locate containers and produces an archive with no data in the log-test folder:

# /usr/local/bin/kubectl-support_bundle version
Replicated Troubleshoot 0.37.1

# /usr/local/bin/kubectl-support_bundle -n default ./kotsadm-logs.yaml 
 * failed to run collector "logs/kots.io/kotsadm=true": failed to get log stream: container "kotsadm" in pod "kotsadm-5fdff6554-gqg96" is not available
 Collecting support bundle ⠼ cluster-resources
support-bundle-2022-07-14T16_30_01.tar.gz

# tar tzf support-bundle-2022-07-14T16_30_01.tar.gz | grep --count log-test
0

Expected Outcome

The support bundle includes logs from all Pods matching the selector, as is the case when the Shutdown pods are removed:

# for ns in $(kubectl get ns| awk 'FNR > 1 {print $1}'); do
  printf 'Cleaning namespace: %s\n' "${ns}"
  kubectl get pods -n "${ns}" | awk '$3 ~ /Shutdown/ { print $1 }' | xargs --no-run-if-empty kubectl delete pods -n "${ns}"
done
Cleaning namespace: default
pod "kotsadm-5fdff6554-gqg96" deleted
pod "kurl-proxy-kotsadm-88448b447-865c5" deleted
Cleaning namespace: kube-node-lease
Cleaning namespace: kube-public
Cleaning namespace: kube-system
Cleaning namespace: kurl
pod "ekc-operator-5b5cffd645-s2hzk" deleted
Cleaning namespace: minio
pod "minio-598fdcc66d-nxhfd" deleted
Cleaning namespace: openebs
Cleaning namespace: projectcontour
pod "contour-6d57f96f7b-dfjbn" deleted
pod "contour-6d57f96f7b-xb59z" deleted
Cleaning namespace: velero
pod "velero-547b94768f-z78lj" deleted

# /usr/local/bin/kubectl-support_bundle -n default ./kotsadm-logs.yaml

 Collecting support bundle ⠹ cluster-resources
support-bundle-2022-07-14T16_45_43.tar.gz

# tar tzf support-bundle-2022-07-14T16_45_43.tar.gz | grep log-test
support-bundle-2022-07-14T16_45_43/log-test/kotsadm-5fdff6554-nntjj/schemahero-apply.log
support-bundle-2022-07-14T16_45_43/log-test/kotsadm-5fdff6554-nntjj/restore-db.log
support-bundle-2022-07-14T16_45_43/log-test/kotsadm-postgres-0.log
support-bundle-2022-07-14T16_45_43/log-test/kotsadm-5fdff6554-nntjj/schemahero-plan.log
support-bundle-2022-07-14T16_45_43/log-test/kotsadm-5fdff6554-nntjj/kotsadm.log
support-bundle-2022-07-14T16_45_43/log-test/kotsadm-5fdff6554-nntjj/restore-s3.log
support-bundle-2022-07-14T16_45_43/log-test/kurl-proxy-kotsadm-88448b447-p9drk.log

Aug 11 '22 03:08 xavpaice

@adamancini are you confirming that #643 resolves this?

Aug 11 '22 21:08 chris-sanders

@chris-sanders yeah I was able to build & test it today and #643 resolves the problem nicely.

Sep 01 '22 17:09 adamancini

troubleshoot troubleshoot copied to clipboard

Troubleshoot stops collecting logs when it encounters a pod in `Shutdown` state

Problem to solve

Proposal

Reproduction Case

Outcome

Expected Outcome

troubleshoot
troubleshoot copied to clipboard