troubleshoot icon indicating copy to clipboard operation
troubleshoot copied to clipboard

[Epic] Instrumentation and observability for Troubleshoot

Open xavpaice opened this issue 2 years ago • 2 comments

Desired outcome

This epic is intended to allow folks that maintain/write Troubleshoot code/specs, and folks using the code, to have insight into what the process is doing when it runs. The final result should provide logging that allows people to understand at which point things fail, hang, or simply document the flow of the code through a run. This result should include the ability to determine timing and if a change has improved, or deteriorated, performance.

How to measure the outcome

  • Time to debug issues with the Troubleshoot code is reduced
  • Overall performance of the Troubleshoot codebase is improved

Neither of these measures have tangible base line measures right now.

Specific problem

When running Troubleshoot, if there are performance issues, or some part hangs, it is impossible to tell what the code is doing in that environment without adding print statements or similar to the code, recompiling and deploying the updated binary to the particular environment.

At this stage, there is minimal logging, with two separate log libraries used (klog and log). This pattern does not allow extra information to be displayed on execution progress, even if the debug switch is enabled.

Additionally, when performance issues arise, there is very little that can be done to determine where the time is spent.

The test suite for Troubleshoot does not include any form of baseline performance measures, and so there is no means to determine if a change introduces a performance problem.

Design Proposal

Private: working backwards doc

  • [x] #936
  • [x] #947

Definition of done

Describe what specific goals can measure if this overall task is considered completed. Things to consider are documentation, high level description of the feature working, and tests.

  • [ ] Changes that introduce performance degradation to collection, analysis and redaction are identifiable during automated testing
  • [ ] Methods to determine the resources (CPU, time, memory) used by functions are available (e.g. pprof)

Subtasks

  • [x] #926
  • [x] Consolidate debug libraries in the Troubleshoot code (https://github.com/replicatedhq/troubleshoot/pull/1008)
  • [x] Add timing information to tests with a failure if the test takes more than a certain time to complete
  • [ ] https://github.com/replicatedhq/troubleshoot/issues/1028

Started

  • [ ]

Planned

  • [x] #922

xavpaice avatar Dec 21 '22 00:12 xavpaice

https://github.com/replicatedhq/troubleshoot/pull/926 pr relevant to this epic

banjoh avatar Dec 22 '22 17:12 banjoh

just #1028 remaining

xavpaice avatar May 28 '23 23:05 xavpaice