troubleshoot
troubleshoot copied to clipboard
[Epic] Instrumentation and observability for Troubleshoot
Desired outcome
This epic is intended to allow folks that maintain/write Troubleshoot code/specs, and folks using the code, to have insight into what the process is doing when it runs. The final result should provide logging that allows people to understand at which point things fail, hang, or simply document the flow of the code through a run. This result should include the ability to determine timing and if a change has improved, or deteriorated, performance.
How to measure the outcome
- Time to debug issues with the Troubleshoot code is reduced
- Overall performance of the Troubleshoot codebase is improved
Neither of these measures have tangible base line measures right now.
Specific problem
When running Troubleshoot, if there are performance issues, or some part hangs, it is impossible to tell what the code is doing in that environment without adding print statements or similar to the code, recompiling and deploying the updated binary to the particular environment.
At this stage, there is minimal logging, with two separate log libraries used (klog and log). This pattern does not allow extra information to be displayed on execution progress, even if the debug switch is enabled.
Additionally, when performance issues arise, there is very little that can be done to determine where the time is spent.
The test suite for Troubleshoot does not include any form of baseline performance measures, and so there is no means to determine if a change introduces a performance problem.
Design Proposal
Private: working backwards doc
- [x] #936
- [x] #947
Definition of done
Describe what specific goals can measure if this overall task is considered completed. Things to consider are documentation, high level description of the feature working, and tests.
- [ ] Changes that introduce performance degradation to collection, analysis and redaction are identifiable during automated testing
- [ ] Methods to determine the resources (CPU, time, memory) used by functions are available (e.g. pprof)
Subtasks
- [x] #926
- [x] Consolidate debug libraries in the Troubleshoot code (https://github.com/replicatedhq/troubleshoot/pull/1008)
- [x] Add timing information to tests with a failure if the test takes more than a certain time to complete
- [ ] https://github.com/replicatedhq/troubleshoot/issues/1028
Started
- [ ]
Planned
- [x] #922
https://github.com/replicatedhq/troubleshoot/pull/926 pr relevant to this epic
just #1028 remaining