listmonk icon indicating copy to clipboard operation
listmonk copied to clipboard

Expose metrics for external consumption

Open avanier opened this issue 2 years ago • 16 comments

As a listmonk administrator, I want to have metrics available externally So that I can consume them and implement things like alerting with outboard systems.


Context

In our particular deployment of listmonk we're processing bounces from SES and the relevant subscribers are not being blocklisted automatically. Exporting metrics that we could hook into our monitoring system would help bring issues to our attention as they arise.

Misc

I'm specifically thinking of opening a PR to expose Prometheus metrics. I've seen that some of the stuff is already instrumented with go-metrics for internal dashboards, so building on that should be no biggie.

Do note, I'm suggesting exposing Prometheus metrics since I'm used to the tooling but we could just as well do something else like Influx, or Statsd. Whichever the community feels is the best common denominator.

avanier avatar Dec 16 '21 21:12 avanier

Hi @avanier, what kind of metrics are you proposing? Can you give some examples?

Prometheus format should be fine.

knadh avatar Dec 17 '21 05:12 knadh

For my particular use-case I know could use exporting something like listmonk_bounces_processed or something along those lines. Otherwise I'm guessing messenger / mailer errors? So generally I'm thinking of things representative of "things that you can do something about and that waiting to discover wouldn't be great".

The Go Prometheus client also has a bunch of "free" ones like build_info that's pretty handy in Kubernetes to watch your deployments roll out.

I'm open to suggestions if there anything that you feel should be part of a first pass.

avanier avatar Dec 17 '21 13:12 avanier

Makes sense. Off the top of my my head:

  • Bounce counts
  • Campaign send rates for running campaigns
  • Campaign counts grouped by status?
  • Subscriber counts grouped by status? (can be an expensive query).

Many of these queries already exist here and here.

knadh avatar Dec 18 '21 05:12 knadh

Hey, status update with good new and bad news.

Good news: I got started and it's looking good. You can check the branch I've got going on here.

Bad news: I likely won't be able to use very many of the queries due to the way Prometheus groks about counters. I mean, I probably could, but making a blocking metrics request is usually an anti-pattern for Prometheus metrics. (It can lead to significant problems at scrape-time which can screw up the metrics themselves once aggregated.)

Silver lining: There's some really nice hooks in Listmonk to latch on, so it shouldn't be too hard to get the metrics we want exposed provided no one objects to me strewing around a bunch of counters all over the place.

Context: I'm a sysadmin by trade, so the engineering part of software engineering is really not my forte. I could absolutely use feedback on how to structure this so it makes sense to maintain in the long run.

avanier avatar Jan 27 '22 08:01 avanier

Was checking out the WIP. The bounce counters increment, but on changing a setting somewhere, listmonk will restart resetting it to 0. Are the metrics supposed to be ephemeral like that?

To return accurate states, eg: sending stats of running campaigns or total bounce counts per campaign etc., a separate goroutine could query the DB for metrics periodically and keep it in a global state somewhere and return on GET /metrics. I am not sure if this is the Prometheus way of doing things though. @mr-karan do you have any thoughts?

knadh avatar Jan 27 '22 10:01 knadh

a separate goroutine could query the DB for metrics periodically and keep it in a global state somewhere and return on GET /metrics. I am not sure if this is the Prometheus way of doing things though

@knadh Okay, so as I understand going through the thread, these metrics are different from instrumentation metrics that you would typically write. Instrumentation of code via metrics means that, when a thing happens, you notify the metrics collector to update a certain value that it keeps in memory. A very simple example:

query, err := fetchFromDB()
if err!=nil {
  errCounter.increment()
}

The key thing to note here is that, we add metrics in the code path as and when real world state changes and not poll for it (like an external Goroutine). This is because here, source of truth is the app itself and not a DB.

However, in case of Listmonk, things like bounces and some other metrics aren't instrumentation metrics but more like analytical. And since they can't be pre-determined (inside the code path), we'll need to do what @knadh suggested:

a separate goroutine could query the DB for metrics periodically and keep it in a global state somewher

This is not a pattern I've observed usually however. I think this also warrants the usage of https://github.com/mr-karan/store-exporter (self plug :smile:), if these metrics can be pulled from DB.

However, if there are certain metrics that aren't recorded on DB but only held within the app, then I think we can bundle the metrics exposition inside listmonk as well.

Sidenote

I mean, I probably could, but making a blocking metrics request is usually an anti-pattern for Prometheus metrics. (It can lead to significant problems at scrape-time which can screw up the metrics themselves once aggregated.)

@avanier Yeah, that's correct. But in the above proposed solution, we can have a separate Goroutine to update the metrics state. When the HTTP request for /metrics comes, it doesn't compute on the fly but it simply flushes the content of metrics map object to the response buffer (hence not blocking).

mr-karan avatar Jan 27 '22 10:01 mr-karan

The whole idea is tickling my spider-senses, but I can't find any issue with @mr-karan 's suggestion. I've never done it this way and that store-exporter seems to make a lot of sense. I'll give it a try. :muscle:

A few random thoughts:

  • @knadh Yes, Prometheus application metrics are meant to be ephemeral. Prometheus is designed to handle counter resets, so that's not an issue either. It's actually a desirable feature since it allows you to see when your application restarts. It's especially useful for clustered applications. Ultimately, for alerting purposes, all we're interested in is the rate, so that disarms a lot edge cases too.
  • If we do this with metrics pulled from the database, we'll be locking ourselves out of the ability of getting granular metrics about which Listmonk instance has done which work if we were to run more than one Listmonk instance in a deployment. I could see this being particularly useful when looking at campaign send rates. (Right now I'm not sure we can do that, can we?)

avanier avatar Jan 27 '22 14:01 avanier

However, in case of Listmonk, things like bounces and some other metrics aren't instrumentation metrics but more like analytical.

If we're just counting bounces as they come, campaign messages as they're sent, subscribers as they're inserted etc., that's all instrumentation I guess? If that's the case, then we can just add counters instead of depending on the DB states.

I could see this being particularly useful when looking at campaign send rates. (Right now I'm not sure we can do that, can we?)

The --passive mode runs a listmonk instance that does not process campaigns. While this can be used to run multiple instances, only one instance connected to the DB should be processing campaigns and sending messages.

knadh avatar Jan 27 '22 17:01 knadh

If we're just counting bounces as they come, campaign messages as they're sent, subscribers as they're inserted etc., that's all instrumentation I guess? If that's the case, then we can just add counters instead of depending on the DB states.

That's usually how I've seen it used. I think if we mean to do "transactional-level accuracy analytics", I think #660 makes more sense for that use-case. Not that you can't use Prometheus for business intelligence, but it wasn't designed for that.

avanier avatar Jan 27 '22 21:01 avanier

the *.Inc() approach makes sense then. Just that it doesn't feel right to bring in Prometheus packages and semantics all across listmonk. A simpler internal package abstraction, something like internal/metrics that wraps the Prometheus metrics package, that can be injected into listmonk's internal packages will be ideal.

knadh avatar Jan 29 '22 11:01 knadh

Yep, that should be possible ^

I'd also add that maybe using https://github.com/VictoriaMetrics/metrics is a better idea than https://github.com/prometheus/client_golang/ because of a much simpler API and a lot less dependencies.

mr-karan avatar Jan 30 '22 03:01 mr-karan

I'd also add that maybe using https://github.com/VictoriaMetrics/metrics is a better idea

+1

knadh avatar Jan 30 '22 07:01 knadh

Ooh, yeah, I'm liking where this is going.

  1. Putting it in a internal/metrics internal package is an idea my sysadmin lizard brain can wrap itself around. I should be able to pull that off and it makes sense. :muscle:
  2. I had completely forgotten about the VictoriaMetrics libraries! We use their product at work and it's just stellar. Also, the guys there are super nice! :grin:

It will take me a while to scrounge up time to put this together, but I'll likely report back within a week or two with something.

avanier avatar Jan 30 '22 19:01 avanier

Here's an update as promised.

Diff Linky

General thoughts:

  • I need a review on how I'm using koanf and the configuration system in general. I have no idea of what I'm doing, but I'm pretty sure I'm doing it wrong.
  • I hadn't registered how barebones the VictoriaMetrics metrics library is and I can't help but feel I'm reinventing the wheel for what the Echo Prometheus middleware provides by default for observing http traffic. I guess I'm taking that kind of stuff for granted, hence why I didn't explicitly state I was expecting to get that out of the metrics too. :confused: This is not an actionnable comment, just a feeling I've yet to process. (I just finished writing this)
  • The way I implemented the function to record bounces is very crude. I do plan to refactor that so it's a bit more generic and better named. The whole thing needs a second pass.
  • Most of my time was spent grokking this rather than getting actual work done. Apparently 🧠 == hard.
  • It might be time for me to actually press the button and make a PR

@knadh && @mr-karan , I could use your feedback before I proceed butchering this thing some more. :smiling_imp: :grin:

avanier avatar Feb 08 '22 22:02 avanier

Nice work @avanier and thanks for this!

Can you create a PR (Draft is fine too). I am unable to leave comments on the diff link.

mr-karan avatar Feb 09 '22 03:02 mr-karan

@mr-karan I've collected this into a PR and gave it a once-over.

It should be a bit more readable now.

avanier avatar Feb 09 '22 08:02 avanier