support-bundle-kit icon indicating copy to clipboard operation
support-bundle-kit copied to clipboard

feat: add node timeout

Open Yu-Jack opened this issue 1 year ago • 0 comments

Problem

Sometime nodes spend too much time collecting logs, even forever. It makes users can't download support bundle kit.

Solution

Add collecting node timeout. When collecting node reach timeout, it will skip it instead of stuck. Then users are able download support bundle without waiting. But, in the situation, there is no node's logs in support bundle file.

For example, if A node is finished before timeout, but B node isn't finished. There is only A node's logs in support bundle file. So, we're still able to check something.

Related Issue

https://github.com/harvester/harvester/issues/1646

Test

This is test case when collecting node reach out the timeout.

level=debug msg="Creating daemonset supportbundle-agent-bundle-1wrog with image jk82421/support-bundle-kit:v0.0.36.4"  
level=debug msg="Waiting for the creation of agent DaemonSet Pods for scheduled node names collection"       
level=debug msg="Expecting bundles from nodes: map[jacklnode:]"      
level=info msg="Some node bundles are received." 
level=warning msg="Collection timed out for node: jacklnode"         
level=info msg="Succeed to run phase node bundle. Progress (60)."

It's hard to simulate the node stuck for 30 minutes, so I suggest following steps to test:

  1. Create support bundle
  2. Change env of deployment/support-bundle-manager-xxxx, set up like this
- name: SUPPORT_BUNDLE_NODE_TIMEOUT
   value: "1s"
  1. Wait for deployment restarting.
  2. After downloading support bundle kit, it shouldn't have node logs there.

TODO

If we're okay with this default timeout, these following features can be postponed.

  • [ ] Node timeout setting/documentation for harvester/harvester
  • [ ] Node timeout setting/documentation for longhorn

Yu-Jack avatar Apr 17 '24 08:04 Yu-Jack