hyperswitch icon indicating copy to clipboard operation
hyperswitch copied to clipboard

feat(metrics): add Vector Throughput & health (via prometheus)

Open lsampras opened this issue 1 year ago • 4 comments

Add a dashboard to monitor vector throughput usage and log loss. The dashboard should show throughput for the following pipes

Throughput

  1. stdout -> loki
  2. stdout -> opensearch
  3. kafka -> loki
  4. Kafka -> transform -> opensearch These flows should include incoming events / outgoing events & dropped events as a time series chart

kafka source should contain consumer lag metrics as well

Health (this would be primarily powered by these metrics)

  • CPU usage of vector
  • Memory usage of vector
  • buffer size
  • errors happening in transforms
  • utilization of each component

Ideally we can take most of the components from a openly available data source by modifying some components to make it geared towards our setup

lsampras avatar Jun 18 '24 12:06 lsampras

@lsampras I am interested in working on this task

Prashant-dot1 avatar Sep 15 '24 03:09 Prashant-dot1

Hey @Prashant-dot1, Thanks for your interest, this issue is available for contribution.

Since this is somewhat of an open issue without fixed specifications. We prefer to get a bit of details about the implementation

  • is there any existing dashboard that you would be using entirely or as a reference?
  • do you plan to create your own dashboard for this?

lsampras avatar Sep 16 '24 06:09 lsampras

@lsampras I am thinking of taking help of these openly available dashboards (these would need modification according to the task)-

Health metrics or system-level metrics, tracking how well the Vector instance is handling all the event pipes together - https://grafana.com/grafana/dashboards/19649-vector-monitoring/

https://grafana.com/grafana/dashboards/721-kafka/

The dashboard structure could be something like this - Row 1: Four panels (one for each pipeline) that show throughput metrics: incoming, outgoing, and dropped events. Row 2: Kafka metrics, specifically consumer lag for the Kafka-related pipelines. Row 3: General health metrics like CPU usage, memory usage, buffer utilization, and error tracking for the overall system.

Prashant-dot1 avatar Sep 19 '24 02:09 Prashant-dot1

@Prashant-dot1 the shared design looks good... I'll assign this

lsampras avatar Sep 26 '24 14:09 lsampras