liftbridge icon indicating copy to clipboard operation
liftbridge copied to clipboard

Monitoring API

Open tylertreat opened this issue 4 years ago • 10 comments

Provide an API that exposes monitoring information and metrics.

We'll need to think on whether this should be part of the gRPC API or a separate HTTP/REST-based API. My inclination is that HTTP is nicer for implementing integrations and can be hit directly from a web browser, curl, etc. for debugging purposes. The downside is it will require running an HTTP server on an additional port.

tylertreat avatar Jul 02 '20 21:07 tylertreat

Just want to add that GRPC-Web can be used and so can be hit by a web browser. You can avoid taking on envoy proxy as a runtime dependency and just embed envoy.

https://www.getenvoy.io/

Example: https://github.com/pomerium/pomerium/blob/master/scripts/embed-envoy.bash

joe-getcouragenow avatar Jul 02 '20 22:07 joe-getcouragenow

Please export the metrics at /metrics in Prometheus format. That would be the best.

annismckenzie avatar Jul 03 '20 07:07 annismckenzie

Hi! Just doing a PoC in a big mesh and i suffer from lack of /metrics

ekbfh avatar Sep 17 '20 15:09 ekbfh

looking forward to this one too :)

definitely a requirement for us to use in production

danthegoodman1 avatar Dec 07 '21 20:12 danthegoodman1

I plan to tackle this once consumer groups is completed. The plan at this time is to implement a /metrics endpoint in Prom format.

tylertreat avatar Dec 07 '21 23:12 tylertreat

I suggest we make proposition here on what metrics should be exposed ? I think it would be nice to have an exhaustive v0 of metrics that are judged to be critical. Any ideas ?

I also think, there are 2 kind of metrics:

  • Separate server metrics: things like CPU, RAM... etc of each server
  • Network/Mesh metrics: things related to the mesh of cluster itself. E.g: Raft, metadata ...etc

For a start, may be it is somehow relevant, here are the list of metrics exposed by the famous Hashicorp Nomad

LaPetiteSouris avatar Dec 10 '21 21:12 LaPetiteSouris

There are probably 3 categories of metrics:

  • Low-level server metrics (CPU, RAM, etc.)
  • Control-plane metrics (Raft information, low-level partition and clustering metrics such as leader information, follower last contact, etc.)
  • Higher-level control plane and data plane metrics (consumer group information, partition message rates, etc.)

There may be others that I am missing, but this is what comes to mind for me initially. To your point, the first step should probably be determining what the minimal critical set of metrics are, then add additional ones once there is an identified need. I would prefer to start small and then build on it.

tylertreat avatar Dec 10 '21 22:12 tylertreat

I suggest to even start smaller.

We can define already the code pattern that shall be used to collect and export metrics.

Each and every metrics that will be added later are basically adapters to add. And they can be added progressively and independently.

This can be processed in parallel with the discussion on the metrics. Or we can rather pick 1-2 metrics in a very arbitraged way to begin with.

LaPetiteSouris avatar Dec 15 '21 09:12 LaPetiteSouris

I would really appreciate ways to calculate produce/consume rates, and even more so consumer lag (time)

danthegoodman1 avatar Jan 10 '22 02:01 danthegoodman1

Hi! Consumer groups are good, so.. :)

I just want to add some values to export in metrics such as HW, Last Offset and Cursor counts. It really helps in investigation of some processes

ekbfh avatar Mar 15 '22 13:03 ekbfh