influxdb metrics: Processing Engine

Processing Engine Metrics

Right now we serve little information on Processing Engine performance, so this ticket looks to add some basic metrics to track.

Update the /metrics endpoint to serve the following metrics:

New Processing Engine Metrics:

[ ] plugin_execution_duration_seconds_bucket: Amount of time spent executing a plugin, per plugin, per trigger, with trigger type, bucketed into 0.001, 0.0025, 0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 10, inf seconds --- Plugin name should be entered as the file name without .py on the end.
[ ] plugin_execution_duration_seconds_sum: Total amount of time spent executing a plugin, per plugin, per trigger, with trigger type
[ ] plugin_execution_duration_seconds_count: Total number of times a plugin is executed, per plugin, per trigger, with trigger type
[ ] processing_engine_memory_size_bytes: Total size of Processing Engine memory in bytes --- If this can be broken down further into threads, that'd be great, but not required
[ ] processing_engine_plugin_errors: Total number of errors, per plugin, per trigger
[ ] processing_engine_memory_size_bytes: Total size of Processing Engine memory in bytes

E.g.

...
plugin_execution_duration_seconds_bucket{plugin="sample_plugin",trigger="sample_trigger",type="on_request",le="0.05"} 2
plugin_execution_duration_seconds_bucket{plugin="sample_plugin",trigger="sample_trigger",type="on_request",le="0.1"} 3
plugin_execution_duration_seconds_bucket{plugin="sample_plugin",trigger="sample_trigger",type="on_request",le="0.25"} 5
...
plugin_execution_duration_seconds_sum{plugin="sample_plugin",trigger="sample_trigger",type="on_request"} 0.68
plugin_execution_duration_seconds_count{plugin="sample_plugin",trigger="sample_trigger",type="on_request"} 5

Apr 06 '25 17:04 peterbarnett03

Labelling the metrics per plugin and per trigger may cause too high of a cardinality, especially for the duration histograms. Would you consider db label to group them by database as we have done for other metrics as a starting point?

Trigger type is one that we can label because the cardinality of that is bounded (there are 5 or so types).

May 16 '25 14:05 hiltontj

We don't expect that many triggers to be defined. What does it look like if they have 1k triggers (an exceptionally high number)?

May 16 '25 14:05 pauldix

I guess cardinality would be N_triggers in this case, regardless of how many plugins or databases or trigger types (each trigger has a single type).

There are 15 lines emitted by the /metrics API for each duration histogram, so there would be worst case 15,000 lines.

I'm not very familiar with the limitations of prometheus or what is considered high cardinality, only that they recommend against unbounded cardinality for labels. If this is acceptable then I won't block it.

May 16 '25 16:05 hiltontj