tokheim

Results 10 comments of tokheim

I believe you are describing the timeslices budgeting method in https://github.com/OpenSLO/OpenSLO.

There have been some similar questions in the past. What I suspect happens is that in some 5 minute periods you have no incoming requests at all, so your divisor...

MSK seems to have similar issue with mismatching arn id. As seen in https://docs.aws.amazon.com/msk/latest/developerguide/msk-create-cluster.html a cluster is given a arn id like `"arn:aws:kafka:us-east-1:123456789012:cluster/CustomConfigExampleCluster/abcd1234-abcd-dcba-4321-a1b2abcd9f9f-2"`. The random characters postfixed in the arn...

So I hope to experiment a bit on how we could make use of this. But we would likely want to run it in kubernetes and want some way to...

Any update on this?

Just want to offer you a partial solution that might help. Metric names are actually treated as labels in prometheus. So you can do an expression like ``` {__name__=~"slo:sli_error:ratio_rate.*",sloth_service="my-service", sloth_slo="my-slo"}...

The normal approach is to consider any request that doesn't meet latency target as a straight up error. This would mean you should use `le` approach, and tailor `le` to...

So root issue is probably that with zero traffic `errorQuery/totalQuery` evaluates to `NaN` especially for the short 5min window size. Unless you use the feature in #241 `slo:period_error_budget_remaining:ratio` is just...

Your updated Sloth spec looks correct, though I don't have access to a prometheus server to test the query at the moment. Dashboard might still show `NaN` until window period...

At least you first need prometheus to record maintenance windows. Either some system that reports this as metric, or if its a fixed time, you could build a recording rule...