troubleshoot RunPod collector improvements

RunPod collector improvements

Open banjoh opened this issue 2 years ago • 5 comments

Describe suggested improvements

Once a pod gets into the PodRunning state, the collector stops waiting any further and starts its clean up process (deleting the pod). This can potentially stop the pod's task prematurely.
When the collector deletes the pod, it probably should ignore the grace period and wait (with timeout) for the pod to get deleted. A new pod that gets created with the same name in the same namespace would fail if the old one is still being deleted. Below is an example of a pod left behind that led to another pod failing to schedule Here is the other use case where the spec name has the same name, leading to the same pod name. The error is debatable cause the collector should probably be named differently. Perhaps we should raise a warning/error when specs have the same name? Either way, here are the results

apiVersion: troubleshoot.sh/v1beta2
kind: SupportBundle
metadata:
  name: example
spec:
  collectors:
    - runPod:
        collectorName: "my-counter"
        timeout: 5s
        podSpec:
          containers:
          - name: my-counter
            image: ubuntu
            command: ["/bin/bash", "-c", 'for i in {1..2}; do echo "count $i"; sleep 1; done']
    - runPod:
        collectorName: "my-counter"
        timeout: 5s
        podSpec:
          containers:
          - name: my-counter
            image: ubuntu
            command: ["/bin/bash", "-c", 'for i in {1..2}; do echo "count again $i"; sleep 1; done']

✔ ~/repos/replicated/troubleshoot [pr/1172 L|✚ 1…2⚑ 3]
[evans] $ ./bin/support-bundle spec.yaml -v0 && kubectl get pods

 * failed to run collector: run-pod/my-counter: failed to run pod: failed to create pod: object is being deleted: pods "my-counter" already exists
 Collecting support bundle ⠏ cluster-info
support-bundle-2023-05-23T17_48_39.tar.gz

============ Collectors summary =============
Suceeded (S), eXcluded (X), Failed (F)
=============================================
run-pod/my-counter (S) : 4,373ms
cluster-resources (S)  : 2,128ms
run-pod/my-counter (F) : 3ms
cluster-info (S)       : 2ms

============ Redactors summary =============
In-cluster collectors : 598ms

============= Analyzers summary =============
Suceeded (S), eXcluded (X), Failed (F)
=============================================
No analyzers executed

Duration: 7,133ms
NAME         READY   STATUS        RESTARTS     AGE
my-counter   1/1     Terminating   1 (1s ago)   5s

Possible improvement to add

Modifying the grace period to "force" deleting the pod immediately might be the trick here. I made this change here. We'd need to consider what side effects this has on the CRI cause a "force" deletion deletes the pod object and instructs the CRI to stop containers. These should be normal operation, but worth calling out for when this gets addressed.

    defer func() {
        if err := client.CoreV1().Pods(pod.Namespace).Delete(context.Background(), pod.Name, metav1.DeleteOptions{}); err != nil {
        g := int64(0)
        if err := client.CoreV1().Pods(pod.Namespace).Delete(context.Background(), pod.Name, metav1.DeleteOptions{
            GracePeriodSeconds: &g,
        }); err != nil {
            klog.Errorf("Failed to delete pod %s: %v", pod.Name, err)
        }
    }()

Results

[evans] $ ./bin/support-bundle spec.yaml -v0 && kubectl get pods

 * failed to run collector: run-pod/my-counter: timeout
 Collecting support bundle ⠴ my-counter
support-bundle-2023-05-23T17_40_09.tar.gz

============ Collectors summary =============
Suceeded (S), eXcluded (X), Failed (F)
=============================================
run-pod/my-counter (F) : 5,063ms
cluster-resources (S)  : 2,131ms
cluster-info (S)       : 2ms

============ Redactors summary =============
In-cluster collectors : 511ms

============= Analyzers summary =============
Suceeded (S), eXcluded (X), Failed (F)
=============================================
No analyzers executed

Duration: 7,735ms
No resources found in default namespace.

Additional context

This issue spawned from https://github.com/replicatedhq/troubleshoot/pull/1172#issuecomment-1559272764 conversation

May 25 '23 16:05 banjoh

troubleshoot troubleshoot copied to clipboard

RunPod collector improvements

troubleshoot
troubleshoot copied to clipboard