buildah icon indicating copy to clipboard operation
buildah copied to clipboard

CI-Flake: bud-multiple-platform-no-run fails: error changing to intended-new-root directory

Open cevich opened this issue 3 years ago • 38 comments

Description

With fairly high frequently this test is failing, but sometimes it passes. This is affecting CI for all PRs for main and branch-testing itself.

Steps to reproduce the issue:

  1. Run CI on a PR or main branch
  2. bud-multiple-platform-no-run test fails (example from PR 3706)
  3. Re-run the task, all tests pass (sometimes)

Describe the results you received:


[+0853s] # [linux/amd64] [2/2] COMMIT
[+0853s] # Getting image source signatures
...cut...
[+0853s] # [linux/s390x] [2/2] COMMIT
[+0853s] # Getting image source signatures
...cut...
[+0853s] # [linux/ppc64le] [2/2] COMMIT
[+0853s] # Writing manifest to image destination
...cut...
[+0853s] # Writing manifest to image destination
[+0853s] # Storing signatures
[+0853s] # 6df1a5aaabbbb36097fefb40bff149c3623dc9008c8f3708fcc1da7dd564bd58
[+0853s] # --> da98c4525b6
[+0853s] # da98c4525b600114fc89b8ced5fa3c4fb7f681adc65f9cfc31dc7e67640d839c
[+0853s] # error building at STEP "COPY --from=0 /root/Dockerfile.no-run /root/": checking on sources under "/var/tmp/buildah_tests.gbywtb/root/overlay/c0299fe974ccf3e964e1b623e5ee121afc0e63b08dfe9ad37da2b932f75cf0df/merged": error in copier subprocess: error changing to intended-new-root directory "/var/tmp/buildah_tests.gbywtb/root/overlay/c0299fe974ccf3e964e1b623e5ee121afc0e63b08dfe9ad37da2b932f75cf0df/merged": chdir /var/tmp/buildah_tests.gbywtb/root/overlay/c0299fe974ccf3e964e1b623e5ee121afc0e63b08dfe9ad37da2b932f75cf0df/merged: no such file or directory
[+0853s] # [ rc=125 (** EXPECTED 0 **) ]
[+0853s] # #/vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
[+0853s] # #| FAIL: exit code is 125; expected 0
[+0853s] # #\^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[+0853s] # /var/tmp/go/src/github.com/containers/buildah/tests

Describe the results you expected:

[+0853s] ok 282 bud-multiple-platform-no-run

Output of rpm -q buildah or apt list buildah:

Buildah compiled at runtime for PR https://github.com/containers/buildah/pull/3706

conmon-2.0.30-2.fc34-x86_64
containernetworking-plugins-1.0.1-1.fc34-x86_64
containers-common-1-21.fc34-noarch
container-selinux-2.170.0-2.fc34-noarch
crun-1.4-1.fc34-x86_64
libseccomp-2.5.3-1.fc34-x86_64
package cri-o-runc is not installed
package libseccomp2 is not installed
podman-3.4.2-1.fc34-x86_64
runc-1.0.2-2.fc34-x86_64
skopeo-1.5.2-1.fc34-x86_64
slirp4netns-1.1.12-2.fc34-x86_64

Output of buildah version:

Version:         1.24.0-dev
Go Version:      go1.16.12
Image Spec:      1.0.2-dev
Runtime Spec:    1.0.2-dev
CNI Spec:        1.0.0
libcni Version:  v1.0.1
image Version:   5.18.0
Git Commit:      0e6980f
Built:           Wed Jan 12 10:36:14 2022
OS/Arch:         linux/amd64
BuildPlatform:   linux/amd64

Output of podman version if reporting a podman build issue:

N/A

Output of cat /etc/*release:

Fedora 35

Output of uname -a:

Fedora 35

Output of cat /etc/containers/storage.conf:

Defaults on Fedora 35

cevich avatar Jan 12 '22 19:01 cevich

@edsantiago, Mr. Holmes, do you have data points that would suggest when the flake has started?

vrothberg avatar Jan 13 '22 08:01 vrothberg

My tools for analyzing buildah flakes aren't as polished as those for podman, sorry. This is the best I can do on short notice.

edsantiago avatar Jan 13 '22 12:01 edsantiago

Oh wow, this has been around for quite a while. I thought it was a new thing. Thanks for the data Ed.

cevich avatar Jan 13 '22 15:01 cevich

Just now in Integration fedora-34 w/ overlay in #3761

edsantiago avatar Feb 01 '22 14:02 edsantiago

Last night on 'main' in cirrus-cron - also F34 + overlay.

cevich avatar Feb 01 '22 15:02 cevich

Seeing a persistent flake in the podman build-under-bud tests, same test, different error message:

....
         # Storing signatures
         # 674eef7a6d059cb2477b713f2935a25c6bcc4dc4456352b96fc2b56f93db560d
         # --> dceea6611a9
         # dceea6611a9eee252a60342b9ecbab67c9d4fc65180dd85a1383a9c1acb34ada
         # Error: error creating build container: writing blob: adding layer with blob "sha256:25be9552a8196e51e8dbb75dae1cfe46cad31f01b456c6569017abd31ee1f9b9": ApplyLayer exit status 1 stdout:  stderr: Error after fallback to chroot: no such file or directory
         # [ rc=125 (** EXPECTED 0 **) ]

[bud] 248 bud-multiple-platform-no-run

edsantiago avatar Feb 09 '22 18:02 edsantiago

This continues to flake fairly regularly in daily branch runs. Latest example.

cevich avatar Feb 17 '22 21:02 cevich

edsantiago avatar Feb 21 '22 17:02 edsantiago

I think there is a race on c/storage where layer gets removed by another parallel build. We could remove flakes if we can make multi-arch build serially. @nalind Could we make multi-arch builds serial until race is removed from upstream. I believe one the SetNames PR should help but could be different race as well.

diff --git a/imagebuildah/build.go b/imagebuildah/build.go
index 77d8b6d5..a71e58aa 100644
--- a/imagebuildah/build.go
+++ b/imagebuildah/build.go
@@ -243,12 +243,12 @@ func BuildDockerfiles(ctx context.Context, store storage.Store, options define.B
                        logPrefix = "[" + platforms.Format(platformSpec) + "] "
                }
                builds.Go(func() error {
+                       instancesLock.Lock()
                        thisID, thisRef, err := buildDockerfilesOnce(ctx, store, logger, logPrefix, platformOptions, paths, files)
                        if err != nil {
                                return err
                        }
                        id, ref = thisID, thisRef
-                       instancesLock.Lock()
                        instances = append(instances, instance{
                                ID:       thisID,
                                Platform: platformSpec,

flouthoc avatar Feb 23 '22 17:02 flouthoc

I'm pretty sure the race here happens when we need to cancel a build because one stage failed, but others are still running. The Executor Delete()s the affected StageExecutor, which deletes its builder, while its Execute() method is still running.

nalind avatar Feb 23 '22 17:02 nalind

@nalind I'm not sure about that but for me it happens even when build was expected to pass and in uncertain manner. For me some failures were also as image not known which is specifically thrown from c/storage.

I also hope that issue is at buildah layer then it would be much easier to get a permanent fix.

flouthoc avatar Feb 23 '22 17:02 flouthoc

A friendly reminder that this issue had no activity for 30 days.

github-actions[bot] avatar Mar 26 '22 00:03 github-actions[bot]

Very funny, github-actions. This is still an enormous annoyance.

edsantiago avatar Mar 26 '22 01:03 edsantiago

@edsantiago Thanks for reminding, I'll start looking at this again, hopefully with catch race this time.

flouthoc avatar Mar 28 '22 10:03 flouthoc

@nalind I think we have three different races happening here:

  • (Happens in both rootless and rootful): I believe race happens on pull API cause I think c/storage does not maps arch and name together so pull from one arch overrides another pull causing error creating build container: error locating pulled image "registry.access.redhat.com/ubi8-micro:latest" name in containers storage: registry.access.redhat.com/ubi8-micro:latest: image not known

This typically happens because most of our pulled image resolution happens by name in build

- `Build A` pulls `alpine (arm64)`
- `Build A` writes pulled image to storage.

- `Build B` pulls `alpine (amd64)`
- `Build B` writes pulled image to storage.

* `Build A` tries to access the pulled image by `LookupImage` by name but its overrided by `Build B` to point to a different `arch`

Mostly I think it happens between https://github.com/containers/common/blob/main/libimage/pull.go#L54 and https://github.com/containers/common/blob/main/libimage/pull.go#L164, we can have a global runtime lock to prevent this.

  • (Happens in root): Underlying overlay merged dir getting removed too early, this I'm unable to track something in c/storage unmounts the layer from another build. Logs for the same: https://cirrus-ci.com/task/5715007043272704?logs=integration_test#L1199

  • (Happens in rootless mostly): Layer is not written at all. Logs for the same: https://cirrus-ci.com/task/5204188027158528?logs=integration_test#L1201

All of these three race scenario looks unrelated to each other at first instance.

flouthoc avatar Mar 29 '22 09:03 flouthoc

Although these races are legit but it should not flake in CI if we are already expecting things to be sequential in CI with --jobs=0 here https://github.com/containers/buildah/blob/main/tests/bud.bats#L3611 above PR should address that.

flouthoc avatar Mar 29 '22 09:03 flouthoc

we can have a global runtime lock to prevent this.

I don't know the code, but that sounds like it might prevent parallel pulling, no? IMHO, pulling in parallel is almost universally desirable since bandwidth is vastly cheaper than engineer-time :grinning:

cevich avatar Mar 29 '22 14:03 cevich

@cevich We can make it granular so it only locks for pull with same name, so it should still allow other pulls in parallel. Anyways problem could be entirely different.

@edsantiago @cevich I have a small question, do we have this flake since inception of bud-multiple-platform-no-run or was this flake introduced somewhere in the middle. Following question would help me diagnose if any commit in c/storage, c/image or buildah introduced these race conditions later.

flouthoc avatar Mar 30 '22 09:03 flouthoc

The first instance I see was 2021-10-08 but unfortunately those logs are gone.

edsantiago avatar Mar 30 '22 11:03 edsantiago

This has been a problem for a very long time. I seem to remember Miloslav or Nalin mentioning the underlying name/pull race problem being in c/storage or c/image. At the time we just ignored it assuming it would be rare :grinning: It's worth talking to them about it, I think they know right where the issue is.

cevich avatar Mar 30 '22 20:03 cevich

@nalind I think we have three different races happening here:

* **(Happens in both rootless and rootful):** I believe race happens on `pull` API cause I think `c/storage` does not maps `arch` and `name` together so pull from one arch overrides another pull causing `error creating build container: error locating pulled image "registry.access.redhat.com/ubi8-micro:latest" name in containers storage: registry.access.redhat.com/ubi8-micro:latest: image not known `

~~We fixed this exact problem in https://github.com/containers/common/pull/688, but the helper function it introduced is no longer being used, so it's possible we broke it since then.~~ [godoc mixup]

nalind avatar Mar 30 '22 20:03 nalind

I'm not surprised, from what I remember from discussions before, there was some unholy (re-implementation level) hell :imp: somewhere in c/storage or image that nobody wanted to touch. So we collectively stuck out heads in the sand. I wouldn't doubt for a second, that's what's biting us now :confused: I bet @mtrmac remembers.

cevich avatar Mar 30 '22 21:03 cevich

The two hairy issues I have identified earlier are https://github.com/containers/storage/issues/1147 (with two proposed fixes) and https://github.com/containers/storage/issues/1136 (with no fix at all).

Those might be related, but I don’t know enough to be able to identify them from logs as definite causes. Note that both should only happen on some kind of failure or forced abort (e.g. a timeout + SIGKILL), not just races during ordinary operation.

During ordinary operation, layerStore is supposed to always be locked before looking up a layer by ID, continuing for the duration of any use use of that layer — so it should, AFAIK never be possible for any external c/storage user to be able to refer to deleted or inconsistent overlayfs content: either the layer exists by the time the lock is obtained, and it should be complete and consistent, or it doesn’t exist, and the code should fail with “layer not found”.

(“Layer not found” races would not surprise me at all, for code paths that leave the c/storage lock scope but assume the storage is substantially unchanged — e.g. an ordinary pull might find a base layer but then fail to apply a child layer if the base layer has been removed in the meantime. But again, that’s not what these logs show.)

So either it’s one of these unclean abort / inconsistency situations, or perhaps there might be a code path that should be locking the layerStore but isn’t doing so. I did check for those situations some time ago, and I couldn’t find anything at least on the major code paths involved in pulls/deletes, but I could well have missed something, e.g. because I was looking at the pull/run paths, not at anything involved in builds.

mtrmac avatar Mar 31 '22 12:03 mtrmac

A friendly reminder that this issue had no activity for 30 days.

github-actions[bot] avatar May 01 '22 00:05 github-actions[bot]

A friendly reminder that this issue had no activity for 30 days.

github-actions[bot] avatar Jun 01 '22 00:06 github-actions[bot]

This issue is still flaking frequently. Recent Example in podman CI.

cevich avatar Jun 07 '22 14:06 cevich

A friendly reminder that this issue had no activity for 30 days.

github-actions[bot] avatar Jul 11 '22 00:07 github-actions[bot]

Dum de dum, still happening

edsantiago avatar Jul 21 '22 14:07 edsantiago