docs.particular.net icon indicating copy to clipboard operation
docs.particular.net copied to clipboard

Feedback: 'Monitoring NServiceBus endpoints with Prometheus and Grafana'

Open bbrandt opened this issue 10 months ago • 0 comments

@andreasohlund @lailabougria

Using this article as a guide, I am attempting to calculate an SLI per microservice per message type of message processing failure rate. When I do this, for some measured intervals, I have noticed that failures count can occasionally exceed fetched count for some message types being processed. As a result, failure ration can exceed 100% at these moments which is a little weird.

Is this caused by immediate retries where a single fetch could result in 0 to many failures as well as 0 to 1 successes? If so, what would you recommend to more accurately represent "number of attempted executions of a handler" which would be more appropriate for this calculation than "fetches"?

Example visualization where SLI exceeds 1: image

Example visualization of the 2 separate time series, fetches (yellow) and failures (green) overlayed: image You can notice a few places failures leaps above fetches.

To do this calculation across all my services and message types I would use this PromQL:

sum by(kubernetes_namespace, app_kubernetes_io_name, nservicebus_message_type) (rate(nservicebus_core_nservicebus_messaging_failures[$__rate_interval])) 
/ 
sum by(kubernetes_namespace, app_kubernetes_io_name, nservicebus_message_type) (rate(nservicebus_core_nservicebus_messaging_fetches[$__rate_interval]))

To filter to a specific namespace and service the PromQL would be:

sum (rate(nservicebus_core_nservicebus_messaging_failures{kubernetes_namespace="mynamespace",app_kubernetes_io_name="my-service-name", nservicebus_message_type=~"some.file.type.name.."}[$__rate_interval])) 
/ 
sum (rate(nservicebus_core_nservicebus_messaging_fetches{kubernetes_namespace="mynamespace",app_kubernetes_io_name="my-service-name", nservicebus_message_type=~"some.file.type.name.."}[$__rate_interval]))

Note: In production, we are still using prometheus-net rather than the OpenTelemetry Prometheus exporter, so I am not sure if metric names here will be exactly as you see them.

Calculation from the article (slightly different because I am interested in "unavailability"/failure rate):

SLO Calculation Another common use case for the rate() function is calculating SLIs, and seeing if you do not violate your SLO/SLA. Google has recently released a popular book for site-reliability engineers. Here is how they calculate the availability of the services: ‍ SLI Formula As you can see, they calculate the rate of change of the amount of all of the requests that were not 5xx and then divide by the rate of change of the total amount of requests. If there are any 5xx responses then the resulting value would be less than one. You can, again, use this formula in your alerting rules with some kind of specified threshold - then you would get an alert if it is violated or you could predict the near future with predict_linear and avoid any SLA/SLO problems.

Feedback for 'Monitoring NServiceBus endpoints with Prometheus and Grafana' https://docs.particular.net/samples/open-telemetry/prometheus-grafana/

Location in GitHub: https://github.com/Particular/docs.particular.net/blob/master/samples/open-telemetry/prometheus-grafana/sample.md

bbrandt avatar Apr 11 '24 21:04 bbrandt