troubleshoot icon indicating copy to clipboard operation
troubleshoot copied to clipboard

test commit

Open drohnow opened this issue 3 years ago • 3 comments
trafficstars

testing.

drohnow avatar Sep 27 '22 20:09 drohnow

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you all sign our Contributor License Agreement before we can accept your contribution.
10 out of 11 committers have signed the CLA.

:white_check_mark: diamonwiggins
:white_check_mark: danj-replicated
:white_check_mark: xavpaice
:white_check_mark: adamancini
:white_check_mark: banjoh
:white_check_mark: edgarlanting
:white_check_mark: ahmedElqutb
:white_check_mark: crdant
:white_check_mark: e3b0c442
:white_check_mark: stefanrepl
:x: David Rohnow


David Rohnow seems not to be a GitHub user. You need a GitHub account to be able to sign the CLA. If you have already a GitHub account, please add the email address used for this commit to your account.
You have signed the CLA already but the status is still pending? Let us recheck it.

CLAassistant avatar Sep 27 '22 20:09 CLAassistant

Hi @xavpaice! Please check out progress thus far. I added context withTimeout to Logs() and configured to cancel when timeout expires. Included mock-context code for POC; will not be merged into main.

Want to discuss if we want to add additional context timeout for the "sub-processes" within the logs().

drohnow avatar Sep 28 '22 17:09 drohnow

Re how to structure the context/timeout. I recall we talked about whether each log collection gets an individual timeout, i.e. 100 pods would add up to 100x the timeout, or if it's an overall timeout for the entire collection. I think this becomes less of an issue if/when we change the logs collection to run concurrently, however sequentially it's a problem. For this particular change, I think it would be preferable to have a 'per pod' timeout, which we could set to a time that is a fairly safe "if it is still going after that long we should call it failed and move on". That time is more predictable.

The intent of the change here is to prevent Troubleshoot from hanging indefinitely because one pod log isn't returning.

Future changes should improve the timeout for large clusters with lots of pods if we trigger log collection concurrently, as the overall time for multiple pods would be similar to 1 pod. I think given that this is in the roadmap, we can live with the tradeoff above till then.

xavpaice avatar Oct 09 '22 21:10 xavpaice