sloth icon indicating copy to clipboard operation
sloth copied to clipboard

NaN in SLO dashboard

Open jenniejia opened this issue 1 year ago • 5 comments

Hello Guys,

We just integrated Sloth SLO Spec to our systems. We used two SLO dashboard from here https://sloth.dev/introduction/dashboards/

Run some traffic and Observed the SLO dashboard, see this "NaN" issue, this is why I am here.. it is Year 2024.. By reading some dev thread. Do we have better solution instead of doing >0 or vector(1) things to avoid NaNs in the first place. But how to fix this PromQL query? SLI window is 5 min

NaN_issue

Thanks, Jennie

jenniejia avatar Jan 23 '24 19:01 jenniejia

Actually I just edit the PromQL to like this slo:period_error_budget_remaining:ratio{sloth_service="${service}", sloth_slo="${slo}"} >0 or on() vector(1)

It seems fixed the issue.. am I correct here?

NaN_fix

jenniejia avatar Jan 23 '24 19:01 jenniejia

So root issue is probably that with zero traffic errorQuery/totalQuery evaluates to NaN especially for the short 5min window size. Unless you use the feature in #241 slo:period_error_budget_remaining:ratio is just the average of the 5 min windows over 30 days, so as soon as the underlying metric records a NaN value the error budget will evaluate to NaN for the next 30 days even if underlying query is fixed.

I'd recommend you try to fix the underlying total query to never evaluate to 0, possibly with >0 or on() vector(1)-logic. These are the metrics that matter for alerting, and I suspect the dashboard will report 100% if value was NaN at any point in time, no matter what budget was actually spent.

Reason why this needs to be handled in the slo query is that how these edge cases should be handled depends on the rest of the query. Still you could argue for derived metrics like slo:period_error_budget_remaining:ratio always should ignore NaN values to make the problem less visible.

tokheim avatar Jan 23 '24 21:01 tokheim

Thank you! But I did a try, it did not work.. I do login to the Prometheus UI to check the recording rule.. One thing I am not clear the following two statement which is correct? since I read previous issue https://github.com/slok/sloth/issues/231

>0 or on() vector(1)

>0 or vector(1)

Here is my updated Sloth Spec, I update the "total_query" to add >0 or vector(1)

sli: events: error_query: sum(rate(nginx_requests{path="/auth",service="myservice",status=~"(5..|499|409)"}[{{.window}}])) total_query: (sum(rate(nginx_requests{path="/auth",service="myservice"}[{{.window}}])) >0) or vector(1)

jenniejia avatar Jan 24 '24 18:01 jenniejia

Your updated Sloth spec looks correct, though I don't have access to a prometheus server to test the query at the moment. Dashboard might still show NaN until window period is over. You could use prometheus admin endpoints to clear metric history, rename or relabel the slo, or wait it out.

You can read about on() here. An expression like (up{instance="x"} > 0) or vector(1) would normally produce two timeseries due to the label mismatch, while (up{instance="x"} > 0) or on() vector(1) produces a single time series like you'd want in a SLO. Since you do a sum(...) your expression doesn't have a label so either should work, but using on() is maybe the safer choice.

tokheim avatar Jan 24 '24 21:01 tokheim

Thank you! after update the total_query and rename the SLO. I got the negative value . i.e. -448%, can you explain to me why and if it make senses? Here is my slo query total_query: (sum(rate(nginx_requests{path="/auth",service="myservice"}[{{.window}}])) >0) or on() vector(1) attached screenshot negative_slo

jenniejia avatar Jan 26 '24 14:01 jenniejia