semantic-conventions icon indicating copy to clipboard operation
semantic-conventions copied to clipboard

CI/CD conventions for metrics

Open christophe-kamphaus-jemmic opened this issue 1 year ago • 4 comments
trafficstars

Area(s)

area:cicd

Is your change request related to a problem? Please describe.

This issue is to discuss attributes specific to metrics and as part of the CI/CD Working Group and Semantic Conventions WG. Also a challenge specific to metrics can the time series cardinality when CICD observes metrics for individual builds.

Describe the solution you'd like

Following https://github.com/open-telemetry/semantic-conventions/pull/1075 (by adjusting the vocabulary here below to align with #1075) we should define metric attributes for

  • duration of pipelineRuns (by status, pipeline)
  • count of pipelineRuns (by status, pipeline)
  • count of agents
  • queue length of pending pipelineRuns
  • duration for how long a pipelineRun is in the queue before starting execution

Additionally it should be possible to opt-in to metrics specific to a particular pipelineRun. These could be metrics about the agent which executes a pipelineRun, the OS, network, jvm, the number of failed/total tests … We need to specify the attribute which should link these metrics to the pipelineRun, eg. pipeline.run.id

Metrics specific to a pipelineRun are of high cardinality. We should document this as a warning and give guidance how these metrics can be efficiently encoded in the OTel protocol, ie by using resource attributes instead of metric attributes wherever possible.

Describe alternatives you've considered

Span metrics could be used for duration and count of pipelineRuns, however this relies on the pipelineRuns having completed. This is due to limitations inherent in using traces to represent pipelineRuns, a span can only be sent when complete. Due to this limitation it could be preferable for the CICD system to expose metrics directly about the duration, count and status of pipelineRuns. These pipelineRuns could account also for in progress builds.

Additional context

CICD metrics were discussed at KubeCon March 2024 SemConv users meeting. High cardinality was highlighted as an issue for per build metrics. Notes on how to deal with cardinality were:

  • Could we use Exemplars? We could link to the build trace from some metrics. This added information might make it easier to identify pipelineRuns that need investigation.
  • Using the resource attribute for the build ID is fine for the OTel protocol, but backends (eg. Prometheus) would still have the cardinality issue when storing the time series (metric / resource attributes would be flattened into time series).

We can use label area:cicd instead of area:new.

Currently have a pull request open to change the metrics in the Git Provider Receiver component within the OTEL Collector to better match the new conventions set in the registry. I think this can help provide contextual implementation details as part of this conversation.

adrielp avatar Jul 29 '24 15:07 adrielp

Let's create additional issues for the separate concerns of metrics:

  • vcs metrics
  • metrics related to job queues
  • metrics related to individual builds (high cardinality issue)

Ie let's have smaller PRs to address them separately

@adrielp Can we use #1184 for point 3 "metrics related to individual builds (high cardinality issue)", perhaps renaming the issue or should this be a separate issue?

Let's review this and break apart what hasn't been done into a separate issue so that we can work to address that in Phase 2. I think the majority of this ticket as been accomplished.

adrielp avatar Jul 31 '25 13:07 adrielp

We did break it down even if it was not linked explicitly here: https://github.com/open-telemetry/semantic-conventions/issues/1111#issuecomment-2317637558

The only thing remaining is https://github.com/open-telemetry/semantic-conventions/issues/1184 which should be done once https://github.com/open-telemetry/semantic-conventions/pull/2237 or https://github.com/open-telemetry/semantic-conventions/pull/2618 is merged.

kamphaus avatar Jul 31 '25 20:07 kamphaus

We can close this issue since all child-issues are now done.

kamphaus avatar Sep 04 '25 05:09 kamphaus