pyrra icon indicating copy to clipboard operation
pyrra copied to clipboard

Proposal for Saturation SLO

Open ArthurSens opened this issue 1 year ago • 2 comments

For a few days now I've been wondering how the implementation would look like for a Saturation SLO based on Prometheus metrics. I've come up with a design idea, so I'm opening this issue to discuss this further with the community.

The main idea here is to re-utilize the BoolGauge SLO as much as possible.

API:

type SaturationIndicator struct {
	// Utilization is the metric that represents the current utilization of the monitored resource.
	Utilization Query `json:"utilization"`

	// Capacity is the metric that represents the capacity of the monitored resource.
	Capacity Query `json:"capacity"`

	// Threshold is the maximum utilization allowed of the monitored resource.
        // It should represent a percentage between Utilization and Capacity.
	// It should be a number between 0 and 1.
	Threshold float64 `json:"threshold"`

	// +optional
	// Grouping allows an SLO to be defined for many SLI at once, like HTTP handlers for example.
	Grouping []string `json:"grouping"`
}

For additional Prometheus rules, all we need to do is generate vector(1) if (Utilization / Capacity) > Threshold and vector(0) if (Utilization / Capacity) <= Threshold. From this, we can reutilize the same prometheus rules used for BoolGauge:

- record: example-saturation-bool
  expr: |
    (vector(1) AND (Utilization / Capacity) > Threshold)
    OR
    vector(0)

## Same from BoolGauge below
- record: example-saturation-bool:count1w
  expr: sum (count_over_time(example-saturation-bool[1w]))

- record: example-saturation-bool:sum1w
  expr: sum (sum_over_time(example-saturation-bool[1w]))
  
- record: example-saturation-bool:burnrate1m
  expr: (sum (count_over_time(example-saturation-bool[1m])) - sum (sum_over_time(probe_success[1m]))) / sum (count_over_time(example-saturation-bool[1m]))
.
.
.
    

ArthurSens avatar Oct 27 '23 21:10 ArthurSens

@metalmatze, friendly ping! Would love to open a PR myself once we agree on a design :)

ArthurSens avatar Nov 13 '23 18:11 ArthurSens

Sorry for the late reply. I was busy organizing PromCon, speaking at SRECon and afterward moving house.

The overall proposal looks good to me. I want to make sure to try this. If we can figure out the PromQL the rest should fall into place.

metalmatze avatar Dec 01 '23 13:12 metalmatze