rds-health icon indicating copy to clipboard operation
rds-health copied to clipboard

Use SUCCESS RATE as the criteria for health assessment

Open fogfish opened this issue 5 months ago • 1 comments

As a user I want to reduce number of false positive reports so that my workflow is not interrupted for the noise.

For example, The rule engine is only uses absolute values to consider success or failure.

Should(rules.OsCpuUtil.Below(40.0, 60.0))

As a consequence, event if a single sampled value is above threshold the utility report an error. It causes a few false positive. Usage of % of success as criteria would be helpful. In the example below, it would be nice to claim failure if success rate is over 60%.

STATUS       %            MIN            AVG            MAX	 ID CHECK
FAILED  32.14%           0.03          13.33         250.61	 D3: storage i/o latency

fogfish avatar Mar 08 '24 13:03 fogfish

The success rate is calculated as percentile of tAvg value, which is actually controls the status. Instead of adding extra config parameter, we should find better ways of educating on configuration. Visualising raw metrics would be better.

fogfish avatar Apr 25 '24 13:04 fogfish