sloth icon indicating copy to clipboard operation
sloth copied to clipboard

Improving Sloth SLOs dashboard

Open w-reichert opened this issue 4 years ago • 8 comments

Hi Xabier, first of all, many thanks for the Sloth SLOs sample dashboard (https://grafana.com/grafana/dashboards/14348)! We are using it since a while. :-)

I noticed that the color coding and ranges for Remaining error budget (month) is not correct. It starts in red if there are no values, it is yellow if there are no errors, and it is green if the budget is below 40%. Furthermore, I suppose negative values should be cut off since empty is empty.

My suggestions:

          "description": "This month remaining error budget, starts the 1st of the month and ends  28th-31st (not rolling window)",
          "fieldConfig": {
            "defaults": {
              "color": {
                "mode": "thresholds"
              },
              "mappings": [],
              "max": 1,
              "min": 0,
              "thresholds": {
                "mode": "absolute",
                "steps": [
                  {
                    "color": "grey",
                    "value": null
                  },
                  {
                    "color": "red",
                    "value": 0
                  },
                  {
                    "color": "orange",
                    "value": 0.01
                  },
                  {
                    "color": "light-yellow",
                    "value": 0.2
                  },
                  {
                    "color": "green",
                    "value": 0.8
                  }
                ]
              },
              "unit": "percentunit"
            },
            "overrides": []
          },

and alike for A rolling window of the total period (30d) error budget remaining.

Furthermore, cutting of negative budget (also occurs twice).

        "expr": "1-clamp_max(sum_over_time( ... ) , 1)",

Thanks and regards Wolfgang

w-reichert avatar Nov 05 '21 15:11 w-reichert

Hey @w-reichert!

Thanks for bringing this up!

I'm planning some changes in Sloth that may affect the dashboards... so when I tackle these, it would be a good time to revisit because I may affect the current dashboards.

Best,

slok avatar Nov 06 '21 11:11 slok

I'm using Sloth SLOs and Grafana dashboard too. It is pretty easy for use and has been working great so far! I also have a feature request for the dashboard. I usually see Month error budget burn chart panel for monitoring, but don't understand if the current burn rate is good at a glance. I would suggest that showing the graph in different colors or drawing an additional line by a burn rate of 1. I'm trying the latter solution that looks like: image

Anyway, thanks for providing this product!

itkq avatar Nov 07 '21 09:11 itkq

Xabier, thanks for the quick response. When you have a new version of Sloth and/or the dashboard we would love to test it and provide feedback.

Regards, Wolfgang

w-reichert avatar Nov 09 '21 08:11 w-reichert

@slok Thank you for your great contributions to SRE world. I see v0.9.0 is released did you included the above ask in this release?

rellupuru avatar Nov 15 '21 19:11 rellupuru

Not yet, I'll need a bit more of time

slok avatar Nov 15 '21 19:11 slok

Hi @w-reichert!

I've revised what you said about the colors, and I did that on purpose. Mainly the error budget you have means that it has been decided to be consumed, so, the perfect error budget left would be 0%. Below that, means that you didn't achieve the reliability you were supposed to have, and above that, means that you didn't consume enough (few experiments, to slow shipping features...).

Anyhow, I would happily change that if people prefer that kind of semaphore coloring while you are approaching 0% error budget left. Regarding the negative, part, you are right, I didn't do that so people are aware of how much they fail.

slok avatar Dec 06 '21 11:12 slok

@itkq Check #216

slok avatar Dec 06 '21 12:12 slok

Hi Xabier @slok, thanks for looking into my recommendations.

Actually the issue we saw started with a red NaN value. Obviously this happens if a service is not running long enough to collect 30-day metrics. Hence my suggestion to begin with "color": "grey" for "value": null. Then "red" may follow for a high negative value.

w-reichert avatar Dec 07 '21 13:12 w-reichert