troubleshoot
troubleshoot copied to clipboard
RunPod collector improvements
Describe suggested improvements
- Once a pod gets into the PodRunning state, the collector stops waiting any further and starts its clean up process (deleting the pod). This can potentially stop the pod's task prematurely.
- When the collector deletes the pod, it probably should ignore the grace period and wait (with timeout) for the pod to get deleted. A new pod that gets created with the same name in the same namespace would fail if the old one is still being deleted. Below is an example of a pod left behind that led to another pod failing to schedule Here is the other use case where the spec name has the same name, leading to the same pod name. The error is debatable cause the collector should probably be named differently. Perhaps we should raise a warning/error when specs have the same name? Either way, here are the results
apiVersion: troubleshoot.sh/v1beta2
kind: SupportBundle
metadata:
name: example
spec:
collectors:
- runPod:
collectorName: "my-counter"
timeout: 5s
podSpec:
containers:
- name: my-counter
image: ubuntu
command: ["/bin/bash", "-c", 'for i in {1..2}; do echo "count $i"; sleep 1; done']
- runPod:
collectorName: "my-counter"
timeout: 5s
podSpec:
containers:
- name: my-counter
image: ubuntu
command: ["/bin/bash", "-c", 'for i in {1..2}; do echo "count again $i"; sleep 1; done']
✔ ~/repos/replicated/troubleshoot [pr/1172 L|✚ 1…2⚑ 3]
[evans] $ ./bin/support-bundle spec.yaml -v0 && kubectl get pods
* failed to run collector: run-pod/my-counter: failed to run pod: failed to create pod: object is being deleted: pods "my-counter" already exists
Collecting support bundle ⠏ cluster-info
support-bundle-2023-05-23T17_48_39.tar.gz
============ Collectors summary =============
Suceeded (S), eXcluded (X), Failed (F)
=============================================
run-pod/my-counter (S) : 4,373ms
cluster-resources (S) : 2,128ms
run-pod/my-counter (F) : 3ms
cluster-info (S) : 2ms
============ Redactors summary =============
In-cluster collectors : 598ms
============= Analyzers summary =============
Suceeded (S), eXcluded (X), Failed (F)
=============================================
No analyzers executed
Duration: 7,133ms
NAME READY STATUS RESTARTS AGE
my-counter 1/1 Terminating 1 (1s ago) 5s
Possible improvement to add
Modifying the grace period to "force" deleting the pod immediately might be the trick here. I made this change here. We'd need to consider what side effects this has on the CRI cause a "force" deletion deletes the pod object and instructs the CRI to stop containers. These should be normal operation, but worth calling out for when this gets addressed.
defer func() {
if err := client.CoreV1().Pods(pod.Namespace).Delete(context.Background(), pod.Name, metav1.DeleteOptions{}); err != nil {
g := int64(0)
if err := client.CoreV1().Pods(pod.Namespace).Delete(context.Background(), pod.Name, metav1.DeleteOptions{
GracePeriodSeconds: &g,
}); err != nil {
klog.Errorf("Failed to delete pod %s: %v", pod.Name, err)
}
}()
Results
[evans] $ ./bin/support-bundle spec.yaml -v0 && kubectl get pods
* failed to run collector: run-pod/my-counter: timeout
Collecting support bundle ⠴ my-counter
support-bundle-2023-05-23T17_40_09.tar.gz
============ Collectors summary =============
Suceeded (S), eXcluded (X), Failed (F)
=============================================
run-pod/my-counter (F) : 5,063ms
cluster-resources (S) : 2,131ms
cluster-info (S) : 2ms
============ Redactors summary =============
In-cluster collectors : 511ms
============= Analyzers summary =============
Suceeded (S), eXcluded (X), Failed (F)
=============================================
No analyzers executed
Duration: 7,735ms
No resources found in default namespace.
Additional context
This issue spawned from https://github.com/replicatedhq/troubleshoot/pull/1172#issuecomment-1559272764 conversation