buildah
buildah copied to clipboard
CI-Flake: bud-multiple-platform-no-run fails: error changing to intended-new-root directory
Description
With fairly high frequently this test is failing, but sometimes it passes. This is affecting CI for all PRs for main and branch-testing itself.
Steps to reproduce the issue:
- Run CI on a PR or
mainbranch bud-multiple-platform-no-run testfails (example from PR 3706)- Re-run the task, all tests pass (sometimes)
Describe the results you received:
[+0853s] # [linux/amd64] [2/2] COMMIT
[+0853s] # Getting image source signatures
...cut...
[+0853s] # [linux/s390x] [2/2] COMMIT
[+0853s] # Getting image source signatures
...cut...
[+0853s] # [linux/ppc64le] [2/2] COMMIT
[+0853s] # Writing manifest to image destination
...cut...
[+0853s] # Writing manifest to image destination
[+0853s] # Storing signatures
[+0853s] # 6df1a5aaabbbb36097fefb40bff149c3623dc9008c8f3708fcc1da7dd564bd58
[+0853s] # --> da98c4525b6
[+0853s] # da98c4525b600114fc89b8ced5fa3c4fb7f681adc65f9cfc31dc7e67640d839c
[+0853s] # error building at STEP "COPY --from=0 /root/Dockerfile.no-run /root/": checking on sources under "/var/tmp/buildah_tests.gbywtb/root/overlay/c0299fe974ccf3e964e1b623e5ee121afc0e63b08dfe9ad37da2b932f75cf0df/merged": error in copier subprocess: error changing to intended-new-root directory "/var/tmp/buildah_tests.gbywtb/root/overlay/c0299fe974ccf3e964e1b623e5ee121afc0e63b08dfe9ad37da2b932f75cf0df/merged": chdir /var/tmp/buildah_tests.gbywtb/root/overlay/c0299fe974ccf3e964e1b623e5ee121afc0e63b08dfe9ad37da2b932f75cf0df/merged: no such file or directory
[+0853s] # [ rc=125 (** EXPECTED 0 **) ]
[+0853s] # #/vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
[+0853s] # #| FAIL: exit code is 125; expected 0
[+0853s] # #\^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[+0853s] # /var/tmp/go/src/github.com/containers/buildah/tests
Describe the results you expected:
[+0853s] ok 282 bud-multiple-platform-no-run
Output of rpm -q buildah or apt list buildah:
Buildah compiled at runtime for PR https://github.com/containers/buildah/pull/3706
conmon-2.0.30-2.fc34-x86_64
containernetworking-plugins-1.0.1-1.fc34-x86_64
containers-common-1-21.fc34-noarch
container-selinux-2.170.0-2.fc34-noarch
crun-1.4-1.fc34-x86_64
libseccomp-2.5.3-1.fc34-x86_64
package cri-o-runc is not installed
package libseccomp2 is not installed
podman-3.4.2-1.fc34-x86_64
runc-1.0.2-2.fc34-x86_64
skopeo-1.5.2-1.fc34-x86_64
slirp4netns-1.1.12-2.fc34-x86_64
Output of buildah version:
Version: 1.24.0-dev
Go Version: go1.16.12
Image Spec: 1.0.2-dev
Runtime Spec: 1.0.2-dev
CNI Spec: 1.0.0
libcni Version: v1.0.1
image Version: 5.18.0
Git Commit: 0e6980f
Built: Wed Jan 12 10:36:14 2022
OS/Arch: linux/amd64
BuildPlatform: linux/amd64
Output of podman version if reporting a podman build issue:
N/A
Output of cat /etc/*release:
Fedora 35
Output of uname -a:
Fedora 35
Output of cat /etc/containers/storage.conf:
Defaults on Fedora 35
@edsantiago, Mr. Holmes, do you have data points that would suggest when the flake has started?
My tools for analyzing buildah flakes aren't as polished as those for podman, sorry. This is the best I can do on short notice.
- Integration fedora-33 w/ overlay
- [10-20 13:49] -- #3593
- [11-05 09:05] -- #3615
- [11-15 13:31] -- #3621
- Integration fedora-34 w/ overlay
- [10-08 07:24] -- #3575
- [10-20 13:47] -- #3593
- [10-21 08:10] -- #3593
- [10-26 12:41] -- #3598
- [11-18 13:33] -- #3599
- [11-29 03:26] -- #3651
- [12-03 10:40] -- #3663
- [12-23 07:19] -- #3678
- [01-10 04:30] -- #3695
- [01-12 04:21] -- #3698
- [01-12 06:26] -- #3698
- [01-12 07:31] -- #3698
- [01-11 14:25] -- #3704
- [01-12 10:22] -- #3706
- Integration fedora-35 w/ overlay
- [11-23 07:13] -- #3641
- [11-29 19:32] -- #3657
- [12-15 06:16] -- #3674
- [12-19 06:02] -- #3676
- [12-23 07:17] -- #3678
- [12-20 05:59] -- #3681
- [01-10 10:13] -- #3696
- [01-10 12:00] -- #3696
- [01-12 04:25] -- #3698
- [01-12 06:28] -- #3698
- [01-11 14:33] -- #3702
- [01-12 10:24] -- #3706
- Integration ubuntu-2104 w/ overlay
- [11-01 14:06] -- #3609
- [11-01 23:26] -- #3609
- Integration ubuntu-2110 w/ overlay
- [11-19 11:13] -- #3638
- [11-29 16:50] -- #3656
- [12-23 07:23] -- #3678
- [01-07 14:24] -- #3691
Oh wow, this has been around for quite a while. I thought it was a new thing. Thanks for the data Ed.
Just now in Integration fedora-34 w/ overlay in #3761
Last night on 'main' in cirrus-cron - also F34 + overlay.
Seeing a persistent flake in the podman build-under-bud tests, same test, different error message:
....
# Storing signatures
# 674eef7a6d059cb2477b713f2935a25c6bcc4dc4456352b96fc2b56f93db560d
# --> dceea6611a9
# dceea6611a9eee252a60342b9ecbab67c9d4fc65180dd85a1383a9c1acb34ada
# Error: error creating build container: writing blob: adding layer with blob "sha256:25be9552a8196e51e8dbb75dae1cfe46cad31f01b456c6569017abd31ee1f9b9": ApplyLayer exit status 1 stdout: stderr: Error after fallback to chroot: no such file or directory
# [ rc=125 (** EXPECTED 0 **) ]
[bud] 248 bud-multiple-platform-no-run
- fedora-35 : bud podman fedora-35 root host
- PR #13129
- PR #12961
- PR #12913
- fedora-35 : bud remote fedora-35 root host [remote]
- PR #13129
- PR #12924
This continues to flake fairly regularly in daily branch runs. Latest example.
- [unknown image] : Containerized Integration
- PR #3782
- [unknown image] : Integration fedora-34 w/ overlay
- PR #3784
- PR #3761
- PR #3759
- PR #3751
- [unknown image] : Integration fedora-35 w/ overlay
- PR #3752
- PR #3749
- PR #3743
- [unknown image] : Integration ubuntu-2110 w/ overlay
- PR #3766
- PR #3736
- [unknown image] : Integration ubuntu-2110 w/ vfs
- PR #3757
I think there is a race on c/storage where layer gets removed by another parallel build. We could remove flakes if we can make multi-arch build serially. @nalind Could we make multi-arch builds serial until race is removed from upstream. I believe one the SetNames PR should help but could be different race as well.
diff --git a/imagebuildah/build.go b/imagebuildah/build.go
index 77d8b6d5..a71e58aa 100644
--- a/imagebuildah/build.go
+++ b/imagebuildah/build.go
@@ -243,12 +243,12 @@ func BuildDockerfiles(ctx context.Context, store storage.Store, options define.B
logPrefix = "[" + platforms.Format(platformSpec) + "] "
}
builds.Go(func() error {
+ instancesLock.Lock()
thisID, thisRef, err := buildDockerfilesOnce(ctx, store, logger, logPrefix, platformOptions, paths, files)
if err != nil {
return err
}
id, ref = thisID, thisRef
- instancesLock.Lock()
instances = append(instances, instance{
ID: thisID,
Platform: platformSpec,
I'm pretty sure the race here happens when we need to cancel a build because one stage failed, but others are still running. The Executor Delete()s the affected StageExecutor, which deletes its builder, while its Execute() method is still running.
@nalind I'm not sure about that but for me it happens even when build was expected to pass and in uncertain manner. For me some failures were also as image not known which is specifically thrown from c/storage.
I also hope that issue is at buildah layer then it would be much easier to get a permanent fix.
A friendly reminder that this issue had no activity for 30 days.
Very funny, github-actions. This is still an enormous annoyance.
@edsantiago Thanks for reminding, I'll start looking at this again, hopefully with catch race this time.
@nalind I think we have three different races happening here:
- (Happens in both rootless and rootful): I believe race happens on
pullAPI cause I thinkc/storagedoes not mapsarchandnametogether so pull from one arch overrides another pull causingerror creating build container: error locating pulled image "registry.access.redhat.com/ubi8-micro:latest" name in containers storage: registry.access.redhat.com/ubi8-micro:latest: image not known
This typically happens because most of our pulled image resolution happens by name in build
- `Build A` pulls `alpine (arm64)`
- `Build A` writes pulled image to storage.
- `Build B` pulls `alpine (amd64)`
- `Build B` writes pulled image to storage.
* `Build A` tries to access the pulled image by `LookupImage` by name but its overrided by `Build B` to point to a different `arch`
Mostly I think it happens between https://github.com/containers/common/blob/main/libimage/pull.go#L54 and https://github.com/containers/common/blob/main/libimage/pull.go#L164, we can have a global runtime lock to prevent this.
-
(Happens in root): Underlying overlay merged dir getting removed too early, this I'm unable to track something in
c/storageunmounts the layer from another build. Logs for the same: https://cirrus-ci.com/task/5715007043272704?logs=integration_test#L1199 -
(Happens in rootless mostly): Layer is not written at all. Logs for the same: https://cirrus-ci.com/task/5204188027158528?logs=integration_test#L1201
All of these three race scenario looks unrelated to each other at first instance.
Although these races are legit but it should not flake in CI if we are already expecting things to be sequential in CI with --jobs=0 here https://github.com/containers/buildah/blob/main/tests/bud.bats#L3611 above PR should address that.
we can have a global runtime lock to prevent this.
I don't know the code, but that sounds like it might prevent parallel pulling, no? IMHO, pulling in parallel is almost universally desirable since bandwidth is vastly cheaper than engineer-time :grinning:
@cevich We can make it granular so it only locks for pull with same name, so it should still allow other pulls in parallel. Anyways problem could be entirely different.
@edsantiago @cevich I have a small question, do we have this flake since inception of bud-multiple-platform-no-run or was this flake introduced somewhere in the middle. Following question would help me diagnose if any commit in c/storage, c/image or buildah introduced these race conditions later.
The first instance I see was 2021-10-08 but unfortunately those logs are gone.
This has been a problem for a very long time. I seem to remember Miloslav or Nalin mentioning the underlying name/pull race problem being in c/storage or c/image. At the time we just ignored it assuming it would be rare :grinning: It's worth talking to them about it, I think they know right where the issue is.
@nalind I think we have three different races happening here:
* **(Happens in both rootless and rootful):** I believe race happens on `pull` API cause I think `c/storage` does not maps `arch` and `name` together so pull from one arch overrides another pull causing `error creating build container: error locating pulled image "registry.access.redhat.com/ubi8-micro:latest" name in containers storage: registry.access.redhat.com/ubi8-micro:latest: image not known `
~~We fixed this exact problem in https://github.com/containers/common/pull/688, but the helper function it introduced is no longer being used, so it's possible we broke it since then.~~ [godoc mixup]
I'm not surprised, from what I remember from discussions before, there was some unholy (re-implementation level) hell :imp: somewhere in c/storage or image that nobody wanted to touch. So we collectively stuck out heads in the sand. I wouldn't doubt for a second, that's what's biting us now :confused: I bet @mtrmac remembers.
The two hairy issues I have identified earlier are https://github.com/containers/storage/issues/1147 (with two proposed fixes) and https://github.com/containers/storage/issues/1136 (with no fix at all).
Those might be related, but I don’t know enough to be able to identify them from logs as definite causes. Note that both should only happen on some kind of failure or forced abort (e.g. a timeout + SIGKILL), not just races during ordinary operation.
During ordinary operation, layerStore is supposed to always be locked before looking up a layer by ID, continuing for the duration of any use use of that layer — so it should, AFAIK never be possible for any external c/storage user to be able to refer to deleted or inconsistent overlayfs content: either the layer exists by the time the lock is obtained, and it should be complete and consistent, or it doesn’t exist, and the code should fail with “layer not found”.
(“Layer not found” races would not surprise me at all, for code paths that leave the c/storage lock scope but assume the storage is substantially unchanged — e.g. an ordinary pull might find a base layer but then fail to apply a child layer if the base layer has been removed in the meantime. But again, that’s not what these logs show.)
So either it’s one of these unclean abort / inconsistency situations, or perhaps there might be a code path that should be locking the layerStore but isn’t doing so. I did check for those situations some time ago, and I couldn’t find anything at least on the major code paths involved in pulls/deletes, but I could well have missed something, e.g. because I was looking at the pull/run paths, not at anything involved in builds.
A friendly reminder that this issue had no activity for 30 days.
A friendly reminder that this issue had no activity for 30 days.
This issue is still flaking frequently. Recent Example in podman CI.
A friendly reminder that this issue had no activity for 30 days.
Dum de dum, still happening