pyrra
pyrra copied to clipboard
Proposal for Saturation SLO
For a few days now I've been wondering how the implementation would look like for a Saturation SLO based on Prometheus metrics. I've come up with a design idea, so I'm opening this issue to discuss this further with the community.
The main idea here is to re-utilize the BoolGauge SLO as much as possible.
API:
type SaturationIndicator struct {
// Utilization is the metric that represents the current utilization of the monitored resource.
Utilization Query `json:"utilization"`
// Capacity is the metric that represents the capacity of the monitored resource.
Capacity Query `json:"capacity"`
// Threshold is the maximum utilization allowed of the monitored resource.
// It should represent a percentage between Utilization and Capacity.
// It should be a number between 0 and 1.
Threshold float64 `json:"threshold"`
// +optional
// Grouping allows an SLO to be defined for many SLI at once, like HTTP handlers for example.
Grouping []string `json:"grouping"`
}
For additional Prometheus rules, all we need to do is generate vector(1)
if (Utilization / Capacity) > Threshold
and vector(0)
if (Utilization / Capacity) <= Threshold
. From this, we can reutilize the same prometheus rules used for BoolGauge:
- record: example-saturation-bool
expr: |
(vector(1) AND (Utilization / Capacity) > Threshold)
OR
vector(0)
## Same from BoolGauge below
- record: example-saturation-bool:count1w
expr: sum (count_over_time(example-saturation-bool[1w]))
- record: example-saturation-bool:sum1w
expr: sum (sum_over_time(example-saturation-bool[1w]))
- record: example-saturation-bool:burnrate1m
expr: (sum (count_over_time(example-saturation-bool[1m])) - sum (sum_over_time(probe_success[1m]))) / sum (count_over_time(example-saturation-bool[1m]))
.
.
.
@metalmatze, friendly ping! Would love to open a PR myself once we agree on a design :)
Sorry for the late reply. I was busy organizing PromCon, speaking at SRECon and afterward moving house.
The overall proposal looks good to me. I want to make sure to try this. If we can figure out the PromQL the rest should fall into place.