sloth
sloth copied to clipboard
Best strategy to manage > 400 SLOs
Hi there! We will have to create a lot of SLOs (Loading time, Errors rate for > 200 endpoints etc...) Looking at what sloth generated as rules - examples:
- record: slo:sli_error:ratio_rate5m
expr: |
(sum(rate(http_request_duration_seconds_count{job="myservice",code=~"(5..|429)"}[5m])))
/
(sum(rate(http_request_duration_seconds_count{job="myservice"}[5m])))
- record: slo:sli_error:ratio_rate30d
expr: |
sum_over_time(slo:sli_error:ratio_rate5m{sloth_id="myservice-requests-availability", sloth_service="myservice", sloth_slo="requests-availability"}[30d])
/ ignoring (sloth_window)
count_over_time(slo:sli_error:ratio_rate5m{sloth_id="myservice-requests-availability", sloth_service="myservice", sloth_slo="requests-availability"}[30d])
I have a question regarding how to manage hundred of records in case of we need to specify the service name and handler properties in each record. Is it correct and is it worth to create a single record without specifying the job/service or any other property like the following ones:
- record: slo:sli_error:ratio_rate5m
expr: |
(sum(rate(http_request_duration_seconds_count{code=~"(5..|429)"}[5m])))
/
(sum(rate(http_request_duration_seconds_count{}[5m])))
- record: slo:sli_error:ratio_rate30d
expr: |
sum_over_time(slo:sli_error:ratio_rate5m{}[30d])
/ ignoring (sloth_window)
count_over_time(slo:sli_error:ratio_rate5m{}[30d])
Thanks a lot for your help!
I've asked myself same question few days ago. I'm in similar spot (20 microservices, at least 10 SLOs per service). Currently I'm doing POC with helm where I wrap around sloth, so with service deployments I can only specify a single variable (or two), and have all SLOs be deployed. With seeing how this works for few services I did POC with, it makes it quite easy to manage if they are grouped.
Thanks @klubi for your answer! Which records did you create then? A single to handle all the microservices like above? Thanks for sharing
In the end I'll most likely end up with both, but currently main focus is on SLO per microservice. Reasoning behind it is that different microservices can have different objectives for same metric.
snippet from my chart:
sli:
events:
{{- if $maybeLogsErrorLevel.totalQuery }}
totalQuery: {{ $maybeLogsErrorLevel.totalQuery }}
{{- else if eq $tech "spring"}}
totalQuery: sum(rate(logback_events_total{application="{{ .name }}", namespace="{{ $namespace }}"}[{{ "{{.window}}" }}]))
{{- else if eq $tech "go"}}
totalQuery: sum(rate(log_statements_total{application="{{ .name }}", namespace="{{ $namespace }}"}[{{ "{{.window}}" }}]))
{{- end}}
I know what you mean, it works for the SLOs' config but I was more talking about the Prometheus records to not have to create 1 for each application / handler combination - example:
- record: slo:sli_error:ratio_rate5m
expr: |
(sum(rate(http_request_duration_seconds_count{code=~"(5..|429)"}[5m])))
/
(sum(rate(http_request_duration_seconds_count{}[5m])))
Oh, sorry, I guess I misunderstood your question.
Where do you deploy your rules? Into kubernetes with operator?
Either way i think it's anyways better to keep records with services, mostly for maintainability reasons. If you deprecate a service, you just delete whole rule group (or single PrometheusRule manifest), and all service related records are deleted. Sure, definitions are duplicated, and could be simplified, but is it worth it?
We are moving to use argocd and store these things in a file. The thing is that we can reach > 300 SLOs. When I look at the generated records we need to store for only 1 SLO, how will it look like with hundred of SLOs...
# Code generated by Sloth (v0.11.0): https://github.com/slok/sloth.
# DO NOT EDIT.
groups:
- name: sloth-slo-sli-recordings-myservice-requests-availability
rules:
- record: slo:sli_error:ratio_rate5m
expr: |
(sum(rate(http_request_duration_seconds_count{job="myservice",code=~"(5..|429)"}[5m])))
/
(sum(rate(http_request_duration_seconds_count{job="myservice"}[5m])))
labels:
owner: myteam
repo: myorg/myservice
sloth_id: myservice-requests-availability
sloth_service: myservice
sloth_slo: requests-availability
sloth_window: 5m
tier: "2"
- record: slo:sli_error:ratio_rate30m
expr: |
(sum(rate(http_request_duration_seconds_count{job="myservice",code=~"(5..|429)"}[30m])))
/
(sum(rate(http_request_duration_seconds_count{job="myservice"}[30m])))
labels:
owner: myteam
repo: myorg/myservice
sloth_id: myservice-requests-availability
sloth_service: myservice
sloth_slo: requests-availability
sloth_window: 30m
tier: "2"
- record: slo:sli_error:ratio_rate1h
expr: |
(sum(rate(http_request_duration_seconds_count{job="myservice",code=~"(5..|429)"}[1h])))
/
(sum(rate(http_request_duration_seconds_count{job="myservice"}[1h])))
labels:
owner: myteam
repo: myorg/myservice
sloth_id: myservice-requests-availability
sloth_service: myservice
sloth_slo: requests-availability
sloth_window: 1h
tier: "2"
- record: slo:sli_error:ratio_rate2h
expr: |
(sum(rate(http_request_duration_seconds_count{job="myservice",code=~"(5..|429)"}[2h])))
/
(sum(rate(http_request_duration_seconds_count{job="myservice"}[2h])))
labels:
owner: myteam
repo: myorg/myservice
sloth_id: myservice-requests-availability
sloth_service: myservice
sloth_slo: requests-availability
sloth_window: 2h
tier: "2"
- record: slo:sli_error:ratio_rate6h
expr: |
(sum(rate(http_request_duration_seconds_count{job="myservice",code=~"(5..|429)"}[6h])))
/
(sum(rate(http_request_duration_seconds_count{job="myservice"}[6h])))
labels:
owner: myteam
repo: myorg/myservice
sloth_id: myservice-requests-availability
sloth_service: myservice
sloth_slo: requests-availability
sloth_window: 6h
tier: "2"
- record: slo:sli_error:ratio_rate1d
expr: |
(sum(rate(http_request_duration_seconds_count{job="myservice",code=~"(5..|429)"}[1d])))
/
(sum(rate(http_request_duration_seconds_count{job="myservice"}[1d])))
labels:
owner: myteam
repo: myorg/myservice
sloth_id: myservice-requests-availability
sloth_service: myservice
sloth_slo: requests-availability
sloth_window: 1d
tier: "2"
- record: slo:sli_error:ratio_rate3d
expr: |
(sum(rate(http_request_duration_seconds_count{job="myservice",code=~"(5..|429)"}[3d])))
/
(sum(rate(http_request_duration_seconds_count{job="myservice"}[3d])))
labels:
owner: myteam
repo: myorg/myservice
sloth_id: myservice-requests-availability
sloth_service: myservice
sloth_slo: requests-availability
sloth_window: 3d
tier: "2"
- record: slo:sli_error:ratio_rate30d
expr: |
sum_over_time(slo:sli_error:ratio_rate5m{sloth_id="myservice-requests-availability", sloth_service="myservice", sloth_slo="requests-availability"}[30d])
/ ignoring (sloth_window)
count_over_time(slo:sli_error:ratio_rate5m{sloth_id="myservice-requests-availability", sloth_service="myservice", sloth_slo="requests-availability"}[30d])
labels:
owner: myteam
repo: myorg/myservice
sloth_id: myservice-requests-availability
sloth_service: myservice
sloth_slo: requests-availability
sloth_window: 30d
tier: "2"
- name: sloth-slo-meta-recordings-myservice-requests-availability
rules:
- record: slo:objective:ratio
expr: vector(0.9990000000000001)
labels:
owner: myteam
repo: myorg/myservice
sloth_id: myservice-requests-availability
sloth_service: myservice
sloth_slo: requests-availability
tier: "2"
- record: slo:error_budget:ratio
expr: vector(1-0.9990000000000001)
labels:
owner: myteam
repo: myorg/myservice
sloth_id: myservice-requests-availability
sloth_service: myservice
sloth_slo: requests-availability
tier: "2"
- record: slo:time_period:days
expr: vector(30)
labels:
owner: myteam
repo: myorg/myservice
sloth_id: myservice-requests-availability
sloth_service: myservice
sloth_slo: requests-availability
tier: "2"
- record: slo:current_burn_rate:ratio
expr: |
slo:sli_error:ratio_rate5m{sloth_id="myservice-requests-availability", sloth_service="myservice", sloth_slo="requests-availability"}
/ on(sloth_id, sloth_slo, sloth_service) group_left
slo:error_budget:ratio{sloth_id="myservice-requests-availability", sloth_service="myservice", sloth_slo="requests-availability"}
labels:
owner: myteam
repo: myorg/myservice
sloth_id: myservice-requests-availability
sloth_service: myservice
sloth_slo: requests-availability
tier: "2"
- record: slo:period_burn_rate:ratio
expr: |
slo:sli_error:ratio_rate30d{sloth_id="myservice-requests-availability", sloth_service="myservice", sloth_slo="requests-availability"}
/ on(sloth_id, sloth_slo, sloth_service) group_left
slo:error_budget:ratio{sloth_id="myservice-requests-availability", sloth_service="myservice", sloth_slo="requests-availability"}
labels:
owner: myteam
repo: myorg/myservice
sloth_id: myservice-requests-availability
sloth_service: myservice
sloth_slo: requests-availability
tier: "2"
- record: slo:period_error_budget_remaining:ratio
expr: 1 - slo:period_burn_rate:ratio{sloth_id="myservice-requests-availability",
sloth_service="myservice", sloth_slo="requests-availability"}
labels:
owner: myteam
repo: myorg/myservice
sloth_id: myservice-requests-availability
sloth_service: myservice
sloth_slo: requests-availability
tier: "2"
- record: sloth_slo_info
expr: vector(1)
labels:
owner: myteam
repo: myorg/myservice
sloth_id: myservice-requests-availability
sloth_mode: cli-gen-prom
sloth_objective: "99.9"
sloth_service: myservice
sloth_slo: requests-availability
sloth_spec: prometheus/v1
sloth_version: v0.11.0
tier: "2"
- name: sloth-slo-alerts-myservice-requests-availability
rules:
- alert: MyServiceHighErrorRate
expr: |
(
max(slo:sli_error:ratio_rate5m{sloth_id="myservice-requests-availability", sloth_service="myservice", sloth_slo="requests-availability"} > (14.4 * 0.0009999999999999432)) without (sloth_window)
and
max(slo:sli_error:ratio_rate1h{sloth_id="myservice-requests-availability", sloth_service="myservice", sloth_slo="requests-availability"} > (14.4 * 0.0009999999999999432)) without (sloth_window)
)
or
(
max(slo:sli_error:ratio_rate30m{sloth_id="myservice-requests-availability", sloth_service="myservice", sloth_slo="requests-availability"} > (6 * 0.0009999999999999432)) without (sloth_window)
and
max(slo:sli_error:ratio_rate6h{sloth_id="myservice-requests-availability", sloth_service="myservice", sloth_slo="requests-availability"} > (6 * 0.0009999999999999432)) without (sloth_window)
)
labels:
category: availability
routing_key: myteam
severity: pageteam
sloth_severity: page
annotations:
summary: High error rate on 'myservice' requests responses
title: (page) {{$labels.sloth_service}} {{$labels.sloth_slo}} SLO error budget
burn rate is too fast.
- alert: MyServiceHighErrorRate
expr: |
(
max(slo:sli_error:ratio_rate2h{sloth_id="myservice-requests-availability", sloth_service="myservice", sloth_slo="requests-availability"} > (3 * 0.0009999999999999432)) without (sloth_window)
and
max(slo:sli_error:ratio_rate1d{sloth_id="myservice-requests-availability", sloth_service="myservice", sloth_slo="requests-availability"} > (3 * 0.0009999999999999432)) without (sloth_window)
)
or
(
max(slo:sli_error:ratio_rate6h{sloth_id="myservice-requests-availability", sloth_service="myservice", sloth_slo="requests-availability"} > (1 * 0.0009999999999999432)) without (sloth_window)
and
max(slo:sli_error:ratio_rate3d{sloth_id="myservice-requests-availability", sloth_service="myservice", sloth_slo="requests-availability"} > (1 * 0.0009999999999999432)) without (sloth_window)
)
labels:
category: availability
severity: slack
slack_channel: '#alerts-myteam'
sloth_severity: ticket
annotations:
summary: High error rate on 'myservice' requests responses
title: (ticket) {{$labels.sloth_service}} {{$labels.sloth_slo}} SLO error budget
burn rate is too fast.
So... If you can, I'd suggest using operators... two of them: sloth and prometheus operator.
Then you won't have to store long prometheus rule files, just their definitions, and both operators will make it easier to manage and deploy.
As POC I created 3 metrics, used it with 24 services.
File with PrometheusServiceLevel
manifests is 1,3k lines long.
If I would like to generate PrometheusRules
from that, it would be 11,5k lines...
That's why I don't store either of them... just configure helm chart at deployment time, which requires ~30 lines
I'm honestly just concerned about the number of alert rule groups\namespaces this will generate. I'm on an automated path, where the file maintenance isn't an issue, but the "damage" in the UI is a bit nasty. Is there a way to have Sloth generate all rules into a single group for a given file?
Ideally I'd have a "sloth-slo" namespace with a group for each Sloth file I generate. Is this possible without manipulating the generated prometheus file?