opentelemetry-demo
opentelemetry-demo copied to clipboard
Build multi-target images as part of release
To address #396
multi-target
means multi-arch/platform(such as amd64 and arm64) ? If so, you can it assign to me, I am supporting this. :-P
If possible it'd be nice to have this by the 1.0 release on friday but nbd if you don't have the bandwidth
@JaredTan95
Reopening this -- after merging this in, release builds fail.
@JaredTan95 could you take another look today?
https://github.com/open-telemetry/opentelemetry-demo/actions/runs/3286470674/jobs/5414695123 is a link to a failed run
I noticed a revert PR https://github.com/open-telemetry/opentelemetry-demo/pull/502, I found the failure issue and I will reopen PR after fixed it.
Updates to this issue for posterity:
- I was able to fix the issues causing failed builds. These were mostly due to build contexts not being standard across every service.
- Another persistent issue was OOM kills of the build containers. After some investigation, these seemed to be related to the amount of available memory on a GHA Runner as well as the overhead of trying to parallelize certain build steps.
You can see a successful build here: https://github.com/open-telemetry/opentelemetry-demo/actions/runs/3313405848
However, instead of reducing build time, we've dramatically increased it. There's a few reasons for this:
- Forcing 1x parallelism on Docker itself; turning this off results in OOM kills.
- Adding swap space to work around memory limitations of runners.
- Emulating arm64 on x86
In an attempt to work around this, I've discarded several solutions:
- Local caching doesn't help at all since runners are ephemeral.
- Remote caching (i.e., publish intermediate layers) would help but not durably since anytime there's a gRPC/OpenTelemetry update we'd have to do a full rebuild.
- It doesn't seem like it's possible to build different platforms on different machines then merge the manifests later, although it kinda seems like it should be possible. Either way, we only have access to x86 runners and at best this would halve the build time, still leaving us north of 2 hours.
My current train of thought is to see if it's possible to simply throw more resources at the problem. I've opened https://github.com/open-telemetry/community/issues/1281 to request larger runner support added to the organization. I suspect that if we could 2x or 3x our runner size, these problems would be mitigated.
There is one other solution I have in mind, and it's to remove gRPC from the areas where it's causing problems. Payment, Quote, and Shipping are the three big problem areas it seems (especially quote), so if we can remove bloat there then it probably would help. Similarly, it may be worthwhile to go through and normalize gRPC libraries and update them, it seems like there's a lot of outdated stuff and newer versions may be more performant/compact.
What's the current state here @austinlparker ? I think the current build is just x86 right? Our performance is much better now
I think this has been solved in #536. Closing for now
Are you sure @cartersocha Demo Docker images seems only amd64 : https://hub.docker.com/r/otel/demo/tags
Are you sure @cartersocha Demo Docker images seems only amd64 : https://hub.docker.com/r/otel/demo/tags
next tag will release multi-arch images.
Actually we had to remove multi-arch because it takes 4 hours to build. We're working on alternatives still to reduce build time and make this feasible.
The 1.3.1 release is multi-arch