sloth
sloth copied to clipboard
NaN in SLO dashboard
Hello Guys,
We just integrated Sloth SLO Spec to our systems. We used two SLO dashboard from here https://sloth.dev/introduction/dashboards/
Run some traffic and Observed the SLO dashboard, see this "NaN" issue, this is why I am here.. it is Year 2024.. By reading some dev thread. Do we have better solution instead of doing >0 or vector(1) things to avoid NaNs in the first place. But how to fix this PromQL query? SLI window is 5 min
Thanks, Jennie
Actually I just edit the PromQL to like this
slo:period_error_budget_remaining:ratio{sloth_service="${service}", sloth_slo="${slo}"} >0 or on() vector(1)
It seems fixed the issue.. am I correct here?
So root issue is probably that with zero traffic errorQuery/totalQuery
evaluates to NaN
especially for the short 5min window size. Unless you use the feature in #241 slo:period_error_budget_remaining:ratio
is just the average of the 5 min windows over 30 days, so as soon as the underlying metric records a NaN
value the error budget will evaluate to NaN
for the next 30 days even if underlying query is fixed.
I'd recommend you try to fix the underlying total query to never evaluate to 0, possibly with >0 or on() vector(1)
-logic. These are the metrics that matter for alerting, and I suspect the dashboard will report 100% if value was NaN
at any point in time, no matter what budget was actually spent.
Reason why this needs to be handled in the slo query is that how these edge cases should be handled depends on the rest of the query. Still you could argue for derived metrics like slo:period_error_budget_remaining:ratio
always should ignore NaN
values to make the problem less visible.
Thank you! But I did a try, it did not work.. I do login to the Prometheus UI to check the recording rule.. One thing I am not clear the following two statement which is correct? since I read previous issue https://github.com/slok/sloth/issues/231
>0 or on() vector(1)
>0 or vector(1)
Here is my updated Sloth Spec, I update the "total_query" to add >0 or vector(1)
sli: events: error_query: sum(rate(nginx_requests{path="/auth",service="myservice",status=~"(5..|499|409)"}[{{.window}}])) total_query: (sum(rate(nginx_requests{path="/auth",service="myservice"}[{{.window}}])) >0) or vector(1)
Your updated Sloth spec looks correct, though I don't have access to a prometheus server to test the query at the moment. Dashboard might still show NaN
until window period is over. You could use prometheus admin endpoints to clear metric history, rename or relabel the slo, or wait it out.
You can read about on()
here. An expression like (up{instance="x"} > 0) or vector(1)
would normally produce two timeseries due to the label mismatch, while (up{instance="x"} > 0) or on() vector(1)
produces a single time series like you'd want in a SLO. Since you do a sum(...)
your expression doesn't have a label so either should work, but using on()
is maybe the safer choice.
Thank you! after update the total_query and rename the SLO. I got the negative value . i.e. -448%, can you explain to me why and if it make senses?
Here is my slo query
total_query: (sum(rate(nginx_requests{path="/auth",service="myservice"}[{{.window}}])) >0) or on() vector(1)
attached screenshot