Crow
Crow copied to clipboard
Feature Request: Add support for Prometheus metrics
This issue is for collecting ideas and opinions on exposing prometheus metrics via Crow. If it gets the thumbs up, I'm happy to implement and file a PR, or otherwise join in on efforts towards it.
I'm far from an expert in Crow, prometheus, or metrics in general, so please excuse any inaccuracies in what follows. I'm peripherally aware of the existence of OpenMetrics and OpenTelemetry, and maybe these are better routes to go down.
Background
For those who aren't familiar with prometheus, it is an application that pulls metrics from defined services by submitting HTTP requests to them. The service should respond with data in a simple text format. Prometheus collects the metrics provided by the service in a time series, which can then be pulled in by a variety of tools such as Grafana for a beautiful monitoring and diagnostic experience, or AlertManager for sending out notifications via email, slack, etc.
I personally use it in a Crow context for a variety of purposes. I use Crow as an internal monitoring and control interface to an automated system, and I use prometheus to bump a metric when e.g. an action is performed by a user, which then sends out an alert via slack (as human intervention is an exceptional case in our system). I also use it to determine the most used pages, the least used ones, and this tells us where to focus improvement efforts. I use it to detect requests that take too long to process, and I use it to track requests that lead to errors (if this chart starts to climb, there's probably a bad link somewhere, and then I can dig through logs to identify where). If these things sound like the kinds of things you want to be able to do with ease, then this feature is for you.
A typical use case might be deploying Crow on a server and pointing prometheus to it manually, but an increasingly common approach would be to containerise the Crow application, launch it on a platform such as kubernetes, and add annotations to the service definition to have prometheus automatically discover instances. Metrics are stored in a time series so would persist beyond a single instance of Crow.
Non-native Approach
At the moment, my approach is as follows. Say I have a Crow app that I want to add metrics to, E.g. A number of requests per URI, or the typical processing time per request to each URI, or the number of requests with a given header (such as a user ID).
I use prometheus-cpp and, in my main() function, I set up a bunch of counters and gauges that correspond to those metrics, then register those to a registry and expose the registry to a certain address/port (a different port to the one that I bind the Crow app to).
In my Crow code, where an event of interest is handled, typically in the body of a route callback or a websocket hander, I update the corresponding metric(s). The processing cost is typically just a float addition (because prometheus-cpp assumes all metrics are doubles). The prometheus-cpp library then reads and serves the current values when requested.
This approach works okay, but it isn't ideal:
- While prometheus-cpp is a beautiful library that is very easy to use, it is designed for more general applications. It uses civetweb/GoogleTest, whereas serving metrics through Crow would make more sense for a project that is already using it.
- The range of metrics that can be used with Crow non-natively is limited (without significant work) because the updating is performed in user-facing code, and doesn't have access to the guts of Crow.
Ideas
Here are the ideas that I currently have:
- Serving metrics should of course be optional. Middleware seems like a good choice.
- Metrics should be served on a separate port of choice.
- There should be a set of default metrics, such as total requests and average latency of the main app and of the prometheus exporter itself, and number of worker crashes.
- A nice optional addition might be requests by route, but this isn't suitable for every application. The size of the response you send back to prometheus scales with the number of routes, so an application with 5 routes being scraped every 5 seconds will be fine. An application with 500 routes might experience performance problems.
@jake-arkinstall Can you write your library to be easily ported to crow. We’d really love something this good. And I believe this should be in the framework by default.