opentelemetry-demo icon indicating copy to clipboard operation
opentelemetry-demo copied to clipboard

Build multi-target images as part of release

Open austinlparker opened this issue 2 years ago • 7 comments

To address #396

austinlparker avatar Oct 17 '22 15:10 austinlparker

multi-target means multi-arch/platform(such as amd64 and arm64) ? If so, you can it assign to me, I am supporting this. :-P

JaredTan95 avatar Oct 18 '22 10:10 JaredTan95

If possible it'd be nice to have this by the 1.0 release on friday but nbd if you don't have the bandwidth

cartersocha avatar Oct 19 '22 06:10 cartersocha

@JaredTan95

cartersocha avatar Oct 19 '22 06:10 cartersocha

Reopening this -- after merging this in, release builds fail.

austinlparker avatar Oct 20 '22 03:10 austinlparker

@JaredTan95 could you take another look today?

austinlparker avatar Oct 20 '22 03:10 austinlparker

https://github.com/open-telemetry/opentelemetry-demo/actions/runs/3286470674/jobs/5414695123 is a link to a failed run

austinlparker avatar Oct 20 '22 03:10 austinlparker

I noticed a revert PR https://github.com/open-telemetry/opentelemetry-demo/pull/502, I found the failure issue and I will reopen PR after fixed it.

JaredTan95 avatar Oct 21 '22 01:10 JaredTan95

Updates to this issue for posterity:

  • I was able to fix the issues causing failed builds. These were mostly due to build contexts not being standard across every service.
  • Another persistent issue was OOM kills of the build containers. After some investigation, these seemed to be related to the amount of available memory on a GHA Runner as well as the overhead of trying to parallelize certain build steps.

You can see a successful build here: https://github.com/open-telemetry/opentelemetry-demo/actions/runs/3313405848

However, instead of reducing build time, we've dramatically increased it. There's a few reasons for this:

  • Forcing 1x parallelism on Docker itself; turning this off results in OOM kills.
  • Adding swap space to work around memory limitations of runners.
  • Emulating arm64 on x86

In an attempt to work around this, I've discarded several solutions:

  • Local caching doesn't help at all since runners are ephemeral.
  • Remote caching (i.e., publish intermediate layers) would help but not durably since anytime there's a gRPC/OpenTelemetry update we'd have to do a full rebuild.
  • It doesn't seem like it's possible to build different platforms on different machines then merge the manifests later, although it kinda seems like it should be possible. Either way, we only have access to x86 runners and at best this would halve the build time, still leaving us north of 2 hours.

My current train of thought is to see if it's possible to simply throw more resources at the problem. I've opened https://github.com/open-telemetry/community/issues/1281 to request larger runner support added to the organization. I suspect that if we could 2x or 3x our runner size, these problems would be mitigated.

There is one other solution I have in mind, and it's to remove gRPC from the areas where it's causing problems. Payment, Quote, and Shipping are the three big problem areas it seems (especially quote), so if we can remove bloat there then it probably would help. Similarly, it may be worthwhile to go through and normalize gRPC libraries and update them, it seems like there's a lot of outdated stuff and newer versions may be more performant/compact.

austinlparker avatar Oct 31 '22 14:10 austinlparker

What's the current state here @austinlparker ? I think the current build is just x86 right? Our performance is much better now

cartersocha avatar Nov 18 '22 06:11 cartersocha

I think this has been solved in #536. Closing for now

cartersocha avatar Nov 20 '22 21:11 cartersocha

Are you sure @cartersocha Demo Docker images seems only amd64 : https://hub.docker.com/r/otel/demo/tags

nlamirault avatar Nov 23 '22 08:11 nlamirault

Are you sure @cartersocha Demo Docker images seems only amd64 : https://hub.docker.com/r/otel/demo/tags

next tag will release multi-arch images.

JaredTan95 avatar Nov 23 '22 12:11 JaredTan95

Actually we had to remove multi-arch because it takes 4 hours to build. We're working on alternatives still to reduce build time and make this feasible.

austinlparker avatar Nov 23 '22 13:11 austinlparker

The 1.3.1 release is multi-arch

puckpuck avatar Mar 09 '23 03:03 puckpuck