elsa-core icon indicating copy to clipboard operation
elsa-core copied to clipboard

[FEAT] Enhance OpenTelemetry with Default Workflow Engine Metrics

Open Sverre-W opened this issue 1 year ago • 1 comments

In #5810, OpenTelemetry support was introduced. It would be valuable to extend this functionality by providing default metrics for the workflow engine, allowing for better observability and performance tracking. The goal of this issue is to gather feedback on which additional metrics the community would like to see included.

Proposed Metrics

Counters

  • Workflows Started: Total number of workflows that have been initiated.
  • Workflows Resumed: Total number of workflows that have resumed after being suspended.
  • Workflows Faulted: Total number of workflows that have encountered errors.
  • Workflows Suspended: Total number of workflows that have been paused.
  • Activities Executed: Total number of activities that have been performed.
  • Activities Faulted: Total number of activities that have resulted in errors.

Gauges

  • Active Workflows: Number of workflows currently in progress (i.e., not completed or terminated).

Histograms

  • Activity Execution Time: Distribution of execution times for individual activities.
  • Workflow Execution Time: Distribution of total execution times for entire workflows.

Community Input

This issue aims to collect suggestions from the community regarding additional metrics that would be useful for tracking the performance and health of the workflow engine.

Sverre-W avatar Sep 27 '24 03:09 Sverre-W

Hello,

I'm currently playing with these counters, what about :

  • Workflows Finished : Total number of workflows successfully finished.

For what I've done , a lot could be inserted in the middleware.

jdevillard avatar Oct 09 '24 07:10 jdevillard

Hello,

I would also highly appreciate if some metrics would be exported by default. 👍🏻

Regarding the proposed metric "Active Workflows": This might not make sense in a scenario where typical workflow durations are orders of magnitude below the scraping interval. However, I think that adding it as a metric by default still makes sense, but users need to think about whether it makes sense for their specific use case (they always should 😄).

Similar to "Active workflows", one could think of a seperate gauge for "Suspended workflows" or "Queued workflows".

Also, I think that all metrics would profit from some labels like workflowInstanceId et al.

mhichb avatar Feb 14 '25 08:02 mhichb

I am seeking to have this defined in a generic way as part of open telemetry semantic conventions similar to what we have other systems. Would be good to get feedback https://github.com/open-telemetry/semantic-conventions/pull/2387 as I feel those definition can help here.

thompson-tomo avatar Jul 27 '25 08:07 thompson-tomo