support-bundle-kit [Bug] Bundle is stuck permanently if collection agent fails on one node

For reasons outlined in #72, the support bundle collection process could not complete on one node in a cluster. It looks like we wait here indefinitely to receive all expected bundles before proceeding. Since the collection agent on one node failed before checking in, we did not proceed to finish creating the bundle, and the user had nothing to send to support.

Some suggested resolutions:

A timeout mechanism could automatically send on m.ch after some time, even if all bundles had not been received. This would ensure we got something, though we would have to determine what a reasonable timeout should be.
Watch for DaemonSet Pod restarts. After some threshold (or maybe just one), stop expecting the corresponding collection agent to send a bundle.
The collection agent could survive errors like the one the user experienced and send at least something to the manager. This probably doesn't help us in a network partition, etc.

May 25 '23 18:05 ejweber

@Yu-Jack can you help with this one? thanks.

Mar 27 '24 02:03 innobead

@innobead sure, no problem, I'll look into it.

Mar 27 '24 02:03 Yu-Jack

Although adding timeout lets manager finish process eventually, but it's too hard to decide reasonable timeout. The one reason is that different nodes has different environments, we can't expect how long those nodes takes.

Another reason is file size, we can't expect how much the file size will be. For example, there are two example size mentioned in https://github.com/rancher/support-bundle-kit/issues/72, and agent timeout also affects uploading.

So, I think we could monitor the progress of all nodes. In order to achieve this, we need to rewrite our shell script, and combine it with our Golang code, then we will also know what steps are stuck by doing that. After that, we could show progress of each node on the GUI, then we know which node is stuck or failed, and which node is succeed, even show Stop button to terminate them.

Here is my idea:

Use smaller bundle instead of bigger one during collection: In original flow, after we collect A & B, we bundle them as one called bundle.tar.gz, and send it to manager.

In new flow, we could send separated bundle which is A.tar.gz and B.tar.gz to manager. We'll know what we already collected at least. I think something is better than nothing. But, we need to unzip those tarball in the manager pod and place those files in the local.

Then we're able to know what current step is during collection.
Manager pull progress from agent and set timeout for each step: After we get the progress, we'll also get the all steps, and know what steps we don't run yet.
Tail pod log when timeout occurs: If timeout occurs, tail pod logs, and save it into bundle file for further investigation.

If we choose to do simple one, I think we could do:

Add a huge timeout, just make sure the agent will end someday.
Manager pod tails pod log when timeout occurs, and save it into bundle file for further investigation.

The disadvantage of this way is that timeout might be not so useful like I mentioned before. @bk201 WDYT?

Mar 28 '24 09:03 Yu-Jack

My two cents is to make it simple:

Set a reasonable timeout (we can measure the timeout in our development VMs plus some buffer) and the sb manager skip the agent who can't report back in time. A node might be dead or in trouble, the worst case is to ask the user/support to log in to the node to retrieve information. no need to do the fancy tailing thing, but we can include agent pod log (in fact it's already included).
(Optional) Make the timeout values configurable.

Can you check if this issue duplicate with https://github.com/harvester/harvester/issues/1646

Mar 28 '24 15:03 bk201