test-infra icon indicating copy to clipboard operation
test-infra copied to clipboard

Enhancement: tracing support in Prow jobs

Open howardjohn opened this issue 2 years ago • 6 comments

What would you like to be added: Support for distributed tracing in Prow. More details on what this means below

Why is this needed: To give visibility into job execution, both in a single job and in aggregate.


The end result we are looking for is to be able to generate a trace roughly like the following:

2023-07-03_11-31-44

This was done via a POC, I think the real one can have more information.

Prior Art

https://gitlab.com/gitlab-org/gitlab/-/issues/338943 https://buildkite.com/docs/agent/v3/tracing https://plugins.jenkins.io/opentelemetry/ https://github.com/tektoncd/community/blob/main/teps/0124-distributed-tracing-for-tasks-and-pipelines.md

Implementation

Prow job tracing primarily involves two parts: the infrastructure components, and the actual test logic. These should be formed into a single cohesive trace (see picture above, test logic is in yellow).

Test Logic

For the most part, how a test handles tracing is outside of scope of prow - it is the job author's responsibility. However, one aspect that needs care is ensuring spans reported by the test attach to the same trace as the infrastructure spans.

This is done by propagation. However, the typical way this is handle is by HTTP headers (traceparent) in distributed systems; this doesn't work here. While there is no ratified standard for doing this otherwise, there is a growing de-facto standard (see the prior arts) to use TRACEPARENT environment variable (https://github.com/open-telemetry/opentelemetry-specification/issues/740). This seems well suited. This environment variable will need to be passed to the Pod environment and respected when the job sends traces.

Sending traces from the job is fairly straightforward from that point on. They will need to configure the job to send to the same tracing backend, of course, but otherwise can just send traces like normal. One issue may be that many jobs are largely bash; https://github.com/equinix-labs/otel-cli seems well suited to handle these cases.

Prow Infra

For the infra side, we will need to report spans about a variety of things. I think some interesting things to measure are:

  • End to end job execution, as the root span
  • Pod start - end
  • Pod scheduling, image pulling, etc
  • Containers running
  • Actions within these containers - for example, git operations within clonerefs

I think there are two main approaches to this:

  1. Making a tracing reporter. This can look at the ProwJob and maybe other artifacts (clone-records.json) and compute the spans after the fact (its perfectly fine to send spans out of order and in the past).

This is POCed in https://github.com/howardjohn/prow-tracing (as a standalone binary that is pointed at a historic job).

This approach seems the least invasive to me, and is pretty effective I think.

One concern here is that since we are creating the spans after the job runs, we cannot set the TRACEPARENT environment variable on the job in this approach. There are a few options to this. Either we do a bit of the next option and add just the root span outside of the reporter, or we can abuse the fact that trace IDs are globally unique 16 bytes -- just like the prowjob build UID. Using this fact, we can always create the root span with an ID of the build, and test execution can use PROW_BUILD_ID when TRACEPARENT is not set (or that var can be set automatically by prow). This approach is taken in the POC above

  1. Native integration

Rather than retroactive analysis, we can do 'proper' tracing and integrate it throughout prow. This would allow us to generate extremely fine grained traces about whatever we want. The risk is that it permeates the entire codebase, unlike the reporter mode which is completely standalone.

Configuration

I propose this only supports OpenTelemetry, which is the only recommended option these days. Within otel, though, there are a variety of "exporters" allowed. The primary one is "OTLP". This is a common protocol implemented by many vendors. In addition, otel offers a collector which accepts OTLP and does a variety of things, including exporting to anywhere.

One notable vendor that does not support OLTP is GCP tracing. I think most Prow users are using GCP, so this is a natural backend to use.

We could support OTLP + GCP, or just OTLP and GCP users can deploy a collector.

So overall I think we will only need a couple config items for the collector endpoint and maybe a few others

howardjohn avatar Jul 05 '23 15:07 howardjohn