troubleshoot
troubleshoot copied to clipboard
Analyzer result shows pods created by Troubleshoot as terminated
Problem to solve
The support bundle screen in KOTS looks very busy since the metadata/errors about pods failing was added.
Furthermore, there are analyzers shown for pods that were run by the support-bundle process!
Proposal
Any or all of:
- hide failed pods that are descendants of K8s Jobs
- hide failed pods that were created by troubleshoot (e.g.
runcollector) - hide automatic failed pod analyzers entirely until we get something that looks good
- if a deployment or job is in a bad state, just show that and hide analyzers about its descendant replicasets and pods (or, hide analyzers about parent objects and just show the pod analyzers)
Further details
A big contributor here is the Sentry-Provision-User Job usually has 3-5 pods fail while the database comes up, but the job eventually succeeds. This is a very common pattern in many apps, and is kind of what Jobs were designed for.
There have been months of iteration on this feature since SE raised concerns about the original change, and I feel like there's still not a lot of empathy for how this ends up looking when CE runs demos for prospects. This still looks really rough. I am totally in favor of iteration here but it's becoming frustrating the number of SCs that have been written and how long this has been generally degraded.
An alternative would be to change/rearchitect the application all SEs demo with, but that doesn't help the second impact point -- if a prospect uses a Job that has similar behavior (and they are quite common in my experience) then they will see the same busy UI.
Customer Impact (importance/urgency)
This impacts our ability to demo well, and our ability to show newer features. This affects our demo->PoV conversion rate.
This impacts the customers experience with the product -- I fear it will hurt our PoV->close rate as well.
What does business success look like, and how can we measure that?
SEs are demoing with the latest version of kots instead of intentionally downgrading to avoid this busy view.
Higher PoV close rates
Less frequent feedback on lost deals that troubleshoot is "not that good" or that "we can build this ourselves easily".
Links / references
Relates to: https://app.shortcut.com/replicated/story/38792/some-insights-have-null-as-the-level https://app.shortcut.com/replicated/story/38540/support-bundle-analyzer-errors https://app.shortcut.com/replicated/story/39739/analysis-overview-shows-errors-about-pods-in-other-namespaces
Shortcut story for this issue: https://app.shortcut.com/replicated/story/40888/support-bundle-screen-is-too-noisy-it-includes-pods-run-by-the-support-bundle-itself-starting-in-kots-1-59-1
TODO: get reproduction steps, and a demonstration from the CLI
The lab example does not include the clusterResources collector. This is fine, as it is added by Troubleshoot, but at the end of the run rather than at the beginning. What that means is that any jobs and pods started by Troubleshoot will be listed in the pods collected by the clusterResources collector, but they may still be in an unpleasant state, so they're listed by the analyzer. What we should do is have clusterResources run first, prior to any other collectors.
@dexhorthy I will be reasonably confident that if we added - clusterResources {} at the beginning of the collectors, this would resolve the cosmetic issue.
A longer term solution is to ensure that https://github.com/replicatedhq/troubleshoot/blob/e02074941e9b8d1f84ee8a3ada1ee573dedd14ab/pkg/supportbundle/collect.go#L178 doesn't append to the end of the list, but inserts clusterResources at position 0. Simple solution for this would be to return a slice that is the addition of the clusterResources collector plus the 'existing' collectors, in that order, rather than just using append() to add clusterResources to the end.