agent icon indicating copy to clipboard operation
agent copied to clipboard

Less bloated collection of debug metrics from modules

Open ptodev opened this issue 1 year ago • 1 comments

Request

"Debug metrics", is referring to metrics available on the Agent's /metrics endpoint. Those metrics indicate how various internals are working, and are used for alerts and dashboards which monitor the state of Agent instances.

Unfortunately, Agents running with modules have a few issues with their metrics.

Issue 1: The component_id labels can be very long.

This is because the ID includes the "module" component (e.g. module.string) which imported that component. If a module.string imports a module.string which uses prometheus.remote_write, then the ID label on the metric will get quite long.

If the label is extremely long, it may even hit a limit in other systems such as Mimir. That said, by default Mimir sets its max_label_value_length config parameter to 2048 - this should be long enough for most uses.

Long component IDs can make dashboards look awkward if they want to show a component ID in a drop down or a graph legend:

  • Maybe we could make those drop downs work by not using exact component names? E.g. if there is a prometheus.remote_write in a drop down, then the dashboards will show prometheus.remote_write metrics from any module.
  • Alternatively, there could be separate labels for the "module path" and for the leaf component name? This would mean that there will no longer be a singe metric label with identifies a component. Losing such ID labels is not ideal because they have their own usefulness.

Issue 2: Each Flow controller has its own set of metric series.

If an Agent uses multiple Flow controllers, the controller metrics could bloat the /metrics endpoint. To overcome the additional series, could we maybe consolidate controller functionality so that it's ran only once per process?

Should we make debug metrics more configurable?

There might not be a "one size fits all" solution. We might have to solve this by adding some additional settings for how debug metrics should be gathered and transformed? E.g. could there be a metrics block similar to the existing logging block?

Use case

Ease of use.

ptodev avatar Jan 26 '24 19:01 ptodev

We've discussed this offline and came to the conclusion that a good head start would be to separate the parent path into a new label. This would both be an immediate benefit as well as allow us to work with different solutions in the future (eg. hashing long parent paths stemming from nested modules).

tpaschalis avatar Feb 05 '24 19:02 tpaschalis