[EXPEREMENTAL] Fix and Enable paralle build
NOTE: this is experemental. do not merge yet!
Description
- Do not prune builtkit builder for parallel build
For parallel build buildkit builder may be used by other targets, we must not remove it. Besides it is reused by all other running make targets. The only downside is that the container stays on the runner if we move to newer version. However commit e0726b6 - "Create linuxkit-builder manually to support parallel build" fixes this problem.
- update linuxkit to 1.6.4 that has important fixes for cache locking
How to test and validate this PR
just run the build, it should not hang. for more advanced testing refer to PR https://github.com/lf-edge/eve/pull/4993
PR Backports
- 14.5-stable: To be backported.
- 13.4-stable: To be backported.
Checklist
- [x] I've provided a proper description
- [ ] I've added the proper documentation
- [ ] I've tested my PR on amd64 device
- [ ] I've tested my PR on arm64 device
- [x] I've written the test verification instructions
- [x] I've set the proper labels to this PR
And the last but not least:
- [x] I've checked the boxes above, or I've provided a good reason why I didn't check them.
Please, check the boxes above after submitting the PR in interactive mode.
/rerun red
tar: This does not look like a tar archive
tar: OVMF.fd: Not found in archive
tar: Exiting with failure status due to previous errors
make: *** [Makefile:556: /opt/actions-runner/_work/eve/eve/dist/arm64/0.0.0-pr5028-e9eeec69/installer/firmware/OVMF.fd] Error 2
make: *** Waiting for unfinished jobs....
tar: This does not look like a tar archive tar: OVMF.fd: Not found in archive tar: Exiting with failure status due to previous errors make: *** [Makefile:556: /opt/actions-runner/_work/eve/eve/dist/arm64/0.0.0-pr5028-e9eeec69/installer/firmware/OVMF.fd] Error 2 make: *** Waiting for unfinished jobs....
i saw it. looks like cache export problem. Reported to @deitch
The only downside is that the container stays on the runner if we move to newer version.
That should not be an issue. linuxkit itself checks if the builder version matches and, if not, removes it and starts a proper one. With the same retry logic to avoid the race condition there. That was one of the first things we got fixed.
Besides, GHA runners should not be long-lived.
tar: This does not look like a tar archive tar: OVMF.fd: Not found in archive looks like cache export problem.
Yes, it does. This could come from a number of places.
Our assumption all along was that parallel builds will have the following issues:
- Try to start/stop/remove/restart the runner, so we need to have safe retries on that. It is not writing any files, since it is going through the docker engine, so we get reasonably atomic behaviours. We just have to handle those failures sanely. That is done and reliable.
- Try to write the cache
index.jsonin parallel. We now usefcntlon it (where available) to avoid that issue. It is reliable. - Try to create the initial
index.jsonin parallel. We only realized that later, but now we have a temporary shared lock file if it does not exist. This is reliable. - Try to write blobs to the cache itself. We deemed it very unlikely that 2 parallel runs would write the same blob at the same time, as they are building different things with different content.
I think that last assumption is the one that is broken. And it is broken not because of the actual packages we build, but because of parallel jobs checking the same packages. I am not 100% sure about it, but I think so.
Assume we are running this -j=2, to keep it simple. Then we should see parallel builds of pkg/grub and pkg/uefi, for example. In theory those should write different things, whether building locally or pulling.
But when I look in the actions logs, I see for example, Building images/out/rootfs-kvm-nvidia-jp5.yml.in from images/rootfs.yml.in twice, both here and here
I see building uefi twice, both here and here. Same for grub, etc. etc.
This does not look like parallel, it looks like it is doing the same things in both threads.
We can (should?) put in work to try and have the build handle blob cache writes better - it is non-trivial, as we depend on the google go-containerregistry library, so we would need to see what support is there - but I think the faster approach is figuring out what this is doing. It does not look like make really is doing distinct things in parallel, but the same things.
More fundamentally, what tasks are we actually trying to solve in parallel? If it is just the packages build, it would be faster (and easier) to just add parallelism to that. It already supports lkt pkg build pkg/a pkg/b pkg/c, even if it does it in serial. Having a single process means we could control it enough to do it in parallel with less headache and file locks. It is not simple, but it is doable.
But I suspect that make -j is solving some other issues?
Build was successful @rucoder ....
Build was successful @rucoder ....
@rene this is a gambling unfortunately :( we identified 2 more issues and trying to fix them now
@rucoder and I came up with a cleaner way to handle the cache locks. Initial tests show it works, PR is open on linuxkit. I will merge it in there in all cases, but hopefully this solves our issues here as well.
/rerun red
/rerun red
It seems to be working fine. But we see 529: too many requests pretty often now. It would be perfect to solve it before merging.
@rucoder , get the too many requests is an issue, maybe if we fix the number of threads to a lower number (e.g. 4) we can still speed up the build without reach the pull limits...
@rucoder , get the
too many requestsis an issue, maybe if we fix the number of threads to a lower number (e.g. 4) we can still speed up the build without reach the pull limits...
@rene we are testing a solution with docker registry proxy. It seems to be working.
I rebased my PR (https://github.com/lf-edge/eve/pull/5027 ) on it and I got the following error message:
------
> [internal] load metadata for docker.io/lfedge/eve-alpine:523254c1b0728948a16a02115eb817f27c00977f:
------
make: *** [Makefile:1070: eve-mkimage-raw-efi] Error 1
Error: error building "lfedge/eve-wwan:2e505b20dde98b305865dd3ba73172c38eee35cc": error building for arch arm64: failed to solve: lfedge/eve-alpine:523254c1b0728948a16a02115eb817f27c00977f: failed to resolve source metadata for docker.io/lfedge/eve-alpine:523254c1b0728948a16a02115eb817f27c00977f: docker.io/lfedge/eve-alpine:523254c1b0728948a16a02115eb817f27c00977f: not found
2025/07/01 09:32:12 error during command execution: error building "lfedge/eve-wwan:2e505b20dde98b305865dd3ba73172c38eee35cc": error building for arch arm64: failed to solve: lfedge/eve-alpine:523254c1b0728948a16a02115eb817f27c00977f: failed to resolve source metadata for docker.io/lfedge/eve-alpine:523254c1b0728948a16a02115eb817f27c00977f: docker.io/lfedge/eve-alpine:523254c1b0728948a16a02115eb817f27c00977f: not found
make: *** [Makefile:1070: eve-wwan] Error 1
Error: error building "lfedge/eve-fscrypt:848281b21913622ea5bd03713b08fed1c2646cc9": error building for arch arm64: failed to solve: lfedge/eve-alpine:523254c1b0728948a16a02115eb817f27c00977f: failed to resolve source metadata for docker.io/lfedge/eve-alpine:523254c1b0728948a16a02115eb817f27c00977f: docker.io/lfedge/eve-alpine:523254c1b0728948a16a02115eb817f27c00977f: not found
2025/07/01 09:32:12 error during command execution: error building "lfedge/eve-fscrypt:848281b21913622ea5bd03713b08fed1c2646cc9": error building for arch arm64: failed to solve: lfedge/eve-alpine:523254c1b0728948a16a02115eb817f27c00977f: failed to resolve source metadata for docker.io/lfedge/eve-alpine:523254c1b0728948a16a02115eb817f27c00977f: docker.io/lfedge/eve-alpine:523254c1b0728948a16a02115eb817f27c00977f: not found
make: *** [Makefile:1070: eve-fscrypt] Error 1
Error: error building "lfedge/eve-vtpm:0f53604602079b4ce319385689d7f89bf765a1d9": error building for arch arm64: failed to solve: lfedge/eve-alpine:523254c1b0728948a16a02115eb817f27c00977f: failed to resolve source metadata for docker.io/lfedge/eve-alpine:523254c1b0728948a16a02115eb817f27c00977f: docker.io/lfedge/eve-alpine:523254c1b0728948a16a02115eb817f27c00977f: not found
2025/07/01 09:32:12 error during command execution: error building "lfedge/eve-vtpm:0f53604602079b4ce319385689d7f89bf765a1d9": error building for arch arm64: failed to solve: lfedge/eve-alpine:523254c1b0728948a16a02115eb817f27c00977f: failed to resolve source metadata for docker.io/lfedge/eve-alpine:523254c1b0728948a16a02115eb817f27c00977f: docker.io/lfedge/eve-alpine:523254c1b0728948a16a02115eb817f27c00977f: not found
make: *** [Makefile:1070: eve-vtpm] Error 1
Error: error building "lfedge/eve-nvidia:b15ed0827d13d8f7da465fd8d9b48edc9e4a3530-nvidia-jp5": error building for arch arm64: failed to solve: lfedge/eve-alpine:523254c1b0728948a16a02115eb817f27c00977f: failed to resolve source metadata for docker.io/lfedge/eve-alpine:523254c1b0728948a16a02115eb817f27c00977f: docker.io/lfedge/eve-alpine:523254c1b0728948a16a02115eb817f27c00977f: not found
------
> [internal] load metadata for docker.io/lfedge/eve-alpine:523254c1b0728948a16a02115eb817f27c00977f:
------
2025/07/01 09:32:12 error during command execution: error building "lfedge/eve-nvidia:b15ed0827d13d8f7da465fd8d9b48edc9e4a3530-nvidia-jp5": error building for arch arm64: failed to solve: lfedge/eve-alpine:523254c1b0728948a16a02115eb817f27c00977f: failed to resolve source metadata for docker.io/lfedge/eve-alpine:523254c1b0728948a16a02115eb817f27c00977f: docker.io/lfedge/eve-alpine:523254c1b0728948a16a02115eb817f27c00977f: not found
make: *** [Makefile:1070: eve-nvidia] Error 1
#29 sending tarball 10.1s done
#29 DONE 12.2s
https://github.com/lf-edge/eve/actions/runs/15995470983/job/45117781031?pr=5027
@christoph-zededa, thanks for the test...
/rerun red
/rerun red
/rerun yellow