buildx icon indicating copy to clipboard operation
buildx copied to clipboard

OTEL variables intended for bake build delays build

Open rrjjvv opened this issue 4 months ago • 5 comments

Contributing guidelines

I've found a bug and checked that ...

  • [x] ... the documentation does not mention anything about my problem
  • [ ] ... there are no open or closed issues that are related to my problem

Description

Overriding OTEL-related bake variables (intended for building an image) can introduce artificially long build delays depending on the variable values.

(Note: though a real issue, this is not something I encountered in real usage; this is mainly a companion to compose issue https://github.com/docker/compose/issues/13157, which is something I encountered.)

Expected behaviour

For a bake variable used only in the building of an image, I would expect the value to influence build time to the extent that it impacts the build cache. Even more specifically, I would expect a fully-cached build to complete in sub-second time.

Actual behaviour

Depending on the variable name and value, a delay of ten seconds (or more) can occur despite being fully cached.

Buildx version

github.com/docker/buildx v0.26.1 1a8287f

Docker info

Client: Docker Engine - Community
 Version:    28.3.3
 Context:    default
 Debug Mode: false
 Plugins:
  ai: Docker AI Agent - Ask Gordon (Docker Inc.)
    Version:  v1.9.3
    Path:     /home/robertovillarreal/.docker/cli-plugins/docker-ai
  buildx: Docker Buildx (Docker Inc.)
    Version:  v0.26.1
    Path:     /usr/libexec/docker/cli-plugins/docker-buildx
  compose: Docker Compose (Docker Inc.)
    Version:  v2.39.1
    Path:     /usr/libexec/docker/cli-plugins/docker-compose
  model: Docker Model Runner (EXPERIMENTAL) (Docker Inc.)
    Version:  v0.1.36
    Path:     /usr/libexec/docker/cli-plugins/docker-model
  scan: Docker Scan (Docker Inc.)
    Version:  v0.23.0
    Path:     /usr/libexec/docker/cli-plugins/docker-scan

Server:
 Containers: 11
  Running: 4
  Paused: 0
  Stopped: 7
 Images: 97
 Server Version: 28.3.3
 Storage Driver: overlayfs
  driver-type: io.containerd.snapshotter.v1
 Logging Driver: json-file
 Cgroup Driver: systemd
 Cgroup Version: 2
 Plugins:
  Volume: local
  Network: bridge host ipvlan macvlan null overlay
  Log: awslogs fluentd gcplogs gelf journald json-file local splunk syslog
 CDI spec directories:
  /etc/cdi
  /var/run/cdi
 Swarm: inactive
 Runtimes: io.containerd.runc.v2 runc sysbox-runc
 Default Runtime: runc
 Init Binary: docker-init
 containerd version: 05044ec0a9a75232cad458027ca83437aae3f4da
 runc version: v1.2.5-0-g59923ef
 init version: de40ad0
 Security Options:
  apparmor
  seccomp
   Profile: builtin
  cgroupns
 Kernel Version: 6.8.0-60-generic
 Operating System: Ubuntu 24.04.3 LTS
 OSType: linux
 Architecture: x86_64
 CPUs: 16
 Total Memory: 30.56GiB
 Name: l-9jylpn3
 ID: B6K2:2BOW:BSIE:WIGE:RODV:GC2B:JMYF:6XP4:25AT:3S4Q:6634:3OII
 Docker Root Dir: /var/lib/docker
 Debug Mode: true
  File Descriptors: 67
  Goroutines: 127
  System Time: 2025-08-25T20:52:31.9142915-06:00
  EventsListeners: 1
 Experimental: true
 Insecure Registries:
<snip>
  192.168.1.0/24
  ::1/128
  127.0.0.0/8
 Registry Mirrors:
  http://localhost:5005/
  http://localhost:5006/
  http://localhost:5007/
 Live Restore Enabled: false
 Default Address Pools:
   Base: 172.25.0.0/16, Size: 24

Builders list

NAME/NODE                DRIVER/ENDPOINT                             STATUS     BUILDKIT   PLATFORMS
buildkit-dev             docker-container                                                  
 \_ buildkit-dev0         \_ unix:///var/run/docker.sock             inactive              
jd                       docker-container                                                  
 \_ jd0                   \_ unix:///var/run/docker.sock             running    v0.22.0    linux/amd64 (+4), linux/arm64, linux/arm (+2), linux/ppc64le, (7 more)
<snip inactive but sensitive entries>
temp                     docker-container                                                  
 \_ temp0                 \_ unix:///var/run/docker.sock             inactive              
default*                 docker                                                            
 \_ default               \_ default                                 running    v0.23.2    linux/amd64 (+4), linux/arm64, linux/arm (+2), linux/ppc64le, (6 more)
teamx                    docker                                                            
 \_ teamx                 \_ teamx                                   running    v0.23.2    linux/amd64 (+4), linux/arm64, linux/arm (+2), linux/ppc64le, (6 more)
worker                   docker                                                            
 \_ worker                \_ worker                                  running    v0.23.2    linux/amd64 (+2), linux/arm64, linux/arm (+2), linux/ppc64le, (5 more)

Configuration

Bake file:

variable "OTEL_TRACES_EXPORTER" {
  type = string
  default = "none"
}

target "default" {
  dockerfile-inline = <<-EOT
    FROM busybox
    ARG OTEL_TRACES_EXPORTER
    RUN echo "using $OTEL_TRACES_EXPORTER"
  EOT
  args = {
    OTEL_TRACES_EXPORTER = OTEL_TRACES_EXPORTER
  }
}

Very fast, as expected:

$ date && docker buildx bake && date
Mon Aug 25 09:00:30 PM MDT 2025
[+] Building 0.2s (7/7) FINISHED                                                                                                                                                                                         docker:default
<snip>
Mon Aug 25 09:00:30 PM MDT 2025

Always take ten seconds (note the builder says it was .2 seconds as above, as opposed to wall time):

$ date; OTEL_TRACES_EXPORTER=otlp docker buildx bake default; date
Mon Aug 25 09:04:19 PM MDT 2025
[+] Building 0.2s (7/7) FINISHED                                                                                                                                                                                         docker:default
<snip>
Mon Aug 25 09:04:29 PM MDT 2025

To help illustrate the lag:

$ date; OTEL_TRACES_EXPORTER=otlp docker buildx bake default --progress rawjson; date
Mon Aug 25 09:06:09 PM MDT 2025
{"vertexes":[{"digest":"sha256:032bddc7348073368c320605544d844c00e2b5f7e6ed7271de7ecf8e6e49821d","name":"[internal] load local bake definitions","started":"2025-08-25T21:06:10.011840056-06:00"}]}
<snip... time between these two are .2 seconds>
{"vertexes":[{"digest":"sha256:cf54b426da55281043924583d6743b1f70151b6a0169f1b1d3ee7de26f96edee","name":"exporting to image","started":"2025-08-26T03:06:10.234018654Z","completed":"2025-08-26T03:06:10.283281877Z"}]}
<ten seconds between last printed vertex and bake execution>
Mon Aug 25 09:06:20 PM MDT 2025

Build logs


Additional info

My reproduction is not something I'd do in reality; it is very common for me to use OTEL environment variables in a Dockerfile, but they are always static values and not something I'd change at build time. But somebody else might. Though I discovered this "on accident" (https://github.com/docker/compose/issues/13157), this is a grey area. Obviously the BUILDX_*, DOCKER_*, etc. environment variables are more-or-less 'protected', but not OTEL_*. In my example, there doesn't appear to be a way for the user to say "I only want to influence my bake file" or "I want to influence buildx telemetry", or worse... "I want to influence both".

I chose OTEL_TRACES_EXPORTER in my reproduction because in the absence of other OTEL variables, it consistently gives a ten second lag. But if my example was OTEL_EXPORTER_OTLP_ENDPOINT, it would be unlikely that one value would be 'correct' for both inclusion in the image as well as buildx telemetry. And there would be no way to provide each (buildx itself, and the image being created) with its own 'correct' value.

Though a fix for this would likely be low priority (on the bake side), I thought maybe your thoughts of potential solutions might help on influence what the compose folks might do. I noticed that #2447 seems somewhat related (esp. the solutions/strategies discussed).

rrjjvv avatar Aug 26 '25 03:08 rrjjvv

Might be related to https://github.com/moby/buildkit/issues/4616 or at least similar? The symptoms sound the same but the reproduction sounds different. At the same time, they may be related. buildx might be trying to use the tracer and timing out at 10 seconds when it can't reach it. The actual containers may be using the environment variable completely fine because the buildkit instance has access.

jsternberg avatar Oct 06 '25 16:10 jsternberg

Almost certainly related, but slightly different. That one mentions buildkit being intentionally configured with a bad value. In that scenario, a hang obviously isn't desirable, but it's understandable. In my case, buildx is consuming that value for itself (which is my actual bug), whereas my intention was for the value to be a build argument to a Dockerfile.

I had forgotten about it until just now, but I've experienced this same ten second delay (also involving OTEL_TRACES_EXPORTER) in another project, but in this case, the docker daemon itself: https://github.com/earthly/earthly/issues/4066. So maybe your twenty seconds is actually two delays: a ten second delay from the buildkit daemon (analogous to that bug report involving the docker daemon), and then another ten second delay from buildx itself if it inherited that value like you suggested. Just a thought.

So yeah, both are the same as far as bad values causing delays, though my report is that the value was not even intended for bake. (Compose had that same issue as well, but had a relatively straightforward fix since there's a separation between variables for compose itself vs. variables for downstream containers, but bake doesn't have that separation.)

rrjjvv avatar Oct 06 '25 20:10 rrjjvv

Sorry for the long delay on a response to your comment above.

I think this is likely working as designed. Sending OTEL traces is a feature in buildx. While the desired result is to send the environment variable to the builds, it's still setting an environment variable.

The only way I can think of around this is to add the ability to set variables without using environment variables or to use a different name for the environment variable in your configuration when passing it to bake. Something like:

variable "BUILD_TRACES_EXPORTER" {
}

target "mytarget" {
    args = {
        OTEL_TRACES_EXPORTER = BUILD_TRACES_EXPORTER
    }
}

Then use BUILD_TRACES_EXPORTER as the environment variable for passing that environment variable to your build in bake. It might also be possible for us to add an environment variable specific to buildx to disable the OTEL SDK but that wouldn't be my preferred option.

jsternberg avatar Dec 04 '25 15:12 jsternberg

The only way I can think of around this is to add the ability to set variables without using environment variable

Yeah. It wouldn't have to replace the usage of environment variables, just control precedence... e.g.

# closer to a typical CI scenario and your "working as designed"... 
# buildkit consumes 'otlp' for itself, but passed 'console' for the variable itself
$ OTEL_TRACES_EXPORTER=otlp docker buildx bake default --var OTEL_TRACES_EXPORTER=console
# my scenario, where the environment variable is not set at all and there is no OTEL setup, but set for illustration...
# buidkit consumes 'none', and variable set to 'otlp'
$ OTEL_TRACES_EXPORTER=none docker buildx bake default --var OTEL_TRACES_EXPORTER=otlp
# shorter equivalent
$ docker buildx bake default --var OTEL_TRACES_EXPORTER=otlp

or to use a different name for the environment variable in your configuration when passing it to bake

Yup. Pretty straightforward and not terrible... provided you're already aware of the issue/behavior. I guess another work-around would be not declaring the variable at all, and instead doing like

$ docker buildx bake default --set '*.args.OTEL_TRACES_EXPORTER=otlp'

(That would probably be my go-to solution assuming nothing changes.)

It might also be possible for us to add an environment variable specific to buildx to disable the OTEL SDK but that wouldn't be my preferred option.

Agreed.

Adding something like --vars would obviously solve this, but unless that was documented as the preferred method, my guess is that folks would only start using it once they encountered this issue (i.e., already too late). All things considered, the best practical 'solution' might just a quick note somewhere in the docs.

rrjjvv avatar Dec 04 '25 18:12 rrjjvv

I'm going to switch this to a feature request since it's technically working as designed, but we'll consider whether we're going to add another way to specify variables on the command line outside of environment variables.

jsternberg avatar Dec 04 '25 20:12 jsternberg