semantic-conventions icon indicating copy to clipboard operation
semantic-conventions copied to clipboard

[cicd] Define conventions for associating host/pod metrics of a cicd runner with pipeline runs

Open adrielp opened this issue 1 year ago • 14 comments

Overview

Define cicd semantic conventions for resources / entities to allow us to link host or pod metrics of a CICD runner to any jobs using a given runner.

Previously:

~~Update the semantic conventions for CI/CD pipeline runners to embed system attributes once the embed feature is added.~~

do I understand that here we basically need to embed other attributes from os namespace? there is an open PR to do this, maybe you can create a dedicated issue out of it and postpone until it will be used into semconv?

Originally posted by @trisch-me in https://github.com/open-telemetry/semantic-conventions/pull/1075#discussion_r1654395856

adrielp avatar Jun 26 '24 17:06 adrielp

The goal of this issue is to have the capability to link Otel signals emitted by a runner (eg. host metrics, logs, events) to any jobs using a given runner.

Would this mean embedding system metrics under the cicd namespace or to add some cicd attributes (eg. cicd.pipeline.run.id) to the system metrics ?

Related to #1111


SemConv meeting notes 2024-09-16

How could we make the link between a cicd run and metrics emitted by the runner of that run?

This could be a question related to the Entity Group. How would this link be expressed?

  • It could be a resource attribute attached to the metrics stating the cicd.pipeline.run.id
  • It could be a resource attribute of the event that states which runner the pipeline run will execute on

Using resource attribute will work currently: "Okay, the idea behind resource. One of the ideas is that all the telemetry generated from a particular, you know thing component would have the same set of attributes in it. So you can use that to tie, tie the knot and understand. This is the same source."

Using another way to link metrics and cicd run (eg. in an event) is not currently possible. The Entity SIG will be working on answering that question.

Would this mean embedding system metrics under the cicd namespace or to add some cicd attributes (eg. cicd.pipeline.run.id) to the system metrics ?

In general, it's better to just use the existing metrics without copying / embedding them into the cicd namespace.

The exception would be if we wanted to measure something that could not be expressed using the existing metrics and that relates to cicd, then we could think of embedding a metric in cicd namespace.


Using resource attributes is how I have implemented the link between cicd run and runner metrics in https://github.com/jenkinsci/opentelemetry-agent-metrics-plugin.

Downsides to this approach are that

  • the otel generated is limited by the duration of the cicd run (eg. run/runner setup/tear down might not be covered)
  • if a runner is executing multiple cicd runs in parallel, then otel must be emitted for each run separately

Conceptually the link between cicd run and cicd runner is many-to-many:

  • 1-to-1 a cicd run executes always on a single fresh runner which is discarded after the run
  • 1-to-* a cicd run could execute on several runners. This could be for parallelism or to use different environments (eg macOs, win, linux).
  • *-to-1 a runner could allow execution of several runs. Either concurrently to make more efficient use of the runner's resource (similar to containers on a host) or one run after another.
  • any combination of the above

The runners can be static or ephemeral / auto-scaling.

We should define cicd resource semconv. To be able to cover the *-to-1 runner case where several jobs run on the same runner, we could make the type of cicd.pipeline.run.id string[]. :question: Can we dynamically update resource attributes (eg using resource detection)? Or would that require the restart of node_exporter for example?

Can we dynamically update resource attributes (eg using resource detection)?

https://github.com/open-telemetry/opentelemetry-specification/blob/v1.35.0/specification/resource/sdk.md mentions that Resource is immutable.

Or would that require the restart of node_exporter for example?

Most likely. Might this change with the Entity changes?

To update the resource attributes at runtime when pipeline runs associated to a runner change without restarting the process collecting the Otel signals, then we need:

  • https://github.com/open-telemetry/opentelemetry-specification/issues/1298

Discussed in CICD SIG 2025-01-30:

  • Rename title and update desccription of this issue. We received feedback from general SemConv that embedding the attributes/metrics is most likely not the way to go (it would break any existing dashboards). The only exception to embedding is if the semantics are different (ie allowing us to provide a different meaning/description). The solution is most likely to define entities for the CICD domain (ie. semantic conventions for resource attributes).
  • I will join the Entity SIG to get additional feedback
  • Idea: Instead of the host or pod metrics referencing the pipeline run (in resource attribute) can we inverse this link? Ie the Execution spans of the pipeline run know on which runner they are executing (they set an attribute in the span). Could we use spanmetricsconnector to transform this into an info metric that will allow us to filter for host/pod IDs executing a given run and then displaying the host/pod metric for those host/pod IDs?

Discussed in Entities SIG 2025-01-30:

For the CICD domain we might want to define multiple entities:

  • CICD system (eg. Jenkins, Github action)
  • Pipeline run
  • Agent (could be host, pod, container)

Open problem in Entities SIG:

  • How can the relationship be modeled if a metric or entity (eg a pipeline run) could be related to multiple entities and multiple possible different types of entities (eg there could be multiple runners and they could be host or container)

I'm currently exploring the idea "Instead of the host or pod metrics referencing the pipeline run (in resource attribute) can we inverse this link?". Hopefully this will facilitate any SemConv addressing #1184.

I did the investigation resulting in

  • https://github.com/jenkinsci/opentelemetry-agent-metrics-plugin/pull/65

It is indeed possible to have a standard host/pod/container metrics by scraping node-exporter, k8s api metrics, cadvisor and or kube-state-metrics. These metrics would not have any link to the CI runs (neither in attributes nor in resource attributes). As long as we have an info metric like ci_podspan_info_total in ci-pod-metrics-dashboard.json we can query for the EC2 or namespace/pod a pipeline run is executing in. It's also possible to transform CICD traces in an Otel pipeline to export this ci_podspan_info_total metric: collector.yaml

Current limitations of this approach:

  • Even though the metric is called ci.podspan.info it is not strictly an info metric (type is counter, hence the _total suffix). → Can the type of metric be changed in an otel pipeline or would this require a custom processor?
  • The transformation pipeline is based on a single span attribute host.ip to try to resolve the k8s resource attributes. It does not include any information on the agent type (ie. whether the agent is running on k8s or docker or on VM/baremetal). This makes it impossible to have a single dashboard for all pipeline run resource usage metrics, needing one for VM/baremetal and one for k8s. This then requires the author of the CI pipeline to know to which dashboard to link. → Is there a way to specify the type of host?

Conclusion

There are two possibilities for linking agent host metrics with the CI pipeline run:

  • associate the host metrics with the cicd and vcs resources defined in https://github.com/open-telemetry/semantic-conventions/pull/2013
  • use standard host metrics with an additional info metric to make the link between the host id and the CICD pipeline run

kamphaus avatar Apr 02 '25 15:04 kamphaus

From #1111: It's important that any guidance for per-run metrics has a warning about high cardinality

kamphaus avatar Apr 02 '25 20:04 kamphaus

@christophe-kamphaus-jemmic is this roughly a duplicate of #1111 kind of? Seems slightly different, wondering what we can do to break down the work and get some stuff closed and done on this front. I think relatedly, the entities proposal has made it's way to review #2123

adrielp avatar Apr 17 '25 13:04 adrielp

#1111 was the original issue for tracking CICD metrics SemConv. Since it was too big to implement in a single PR we split it up into smaller issues:

  • #1372
  • #1600
  • and #1184

We can close #1111 once #1184 is done.

kamphaus avatar Apr 21 '25 09:04 kamphaus

associate the host metrics with the cicd and vcs resources defined in Add CICD spans and resources #2013

After the PR #2013 and the entity PR https://github.com/open-telemetry/semantic-conventions/pull/2123 are merged, we can start linking host / pod metrics to pipeline runs / workers.

use standard host metrics with an additional info metric to make the link between the host id and the CICD pipeline run

Discussed in SemConv 2025-04-28:

Info metrics are currently not defined in semantic conventions, nor does Prometheus have info metrics. Info metrics are on @jsuereth's todo list, but it will only be in a few months. Until then we can define them as a gauge metric. We can also take a look at how kube_state uses info metrics.

kamphaus avatar Apr 28 '25 20:04 kamphaus

#2013 and #2123 are merged and #2013 already added entity associations for the cicd/vcs resources and the existing cicd/vcs metrics. Remaining todo: How can we link individual pipeline runs with with host/pod metrics? Would it require changing any host, memory, network, os, ... metrics to add the entity association to the several cicd/vcs resources? This would mean that most host/pod metric would list several cicd/vcs resources. I'm not sure this is the clearest way to explain this relation. Or would it be better to add a section on the cicd resource markdown or cicd metrics page?

For the 2nd point about the info metric, I will prepare a PR for that...

kamphaus avatar May 02 '25 20:05 kamphaus

Discussed in SemConv 2025-08-04

Currently there is no progress on defining info metrics in SemConv in a generic way.

This will be discussed in Entities SIG as part of transforming the entity relationships or entity changed descriptive attributes signal into an info metric. (This could take a year.)

To make progress right now:

  • Create a generic SemConv issue to start the discussion around info metrics. Give examples of existing info / state metrics in- / outside k8s.
  • Improve the PR: give examples of use cases. Expand on the description.

kamphaus avatar Aug 04 '25 16:08 kamphaus

Create a generic SemConv issue to start the discussion around info metrics

Done here:

  • https://github.com/open-telemetry/semantic-conventions/issues/2595

kamphaus avatar Aug 04 '25 20:08 kamphaus