storage 'Layer not known' error on parallel image removal

Relates to kubernetes-sigs/cri-tools#472

We discovered some flakyness when running critest in parallel. It seemed that an image we want to remove already got removed by a parallel test, which should not be possible at all.

Jun 21 '19 13:06 saschagrunert

I'm not sure if anyone is already working on this.

/cc @nalind

Jun 21 '19 13:06 saschagrunert

I don't think anyone is working on this. The issue is an image was removed that currently had a container on it? Or could this just be that multiple threads are trying to remove the same image?

Jun 21 '19 18:06 rhatdan

I think it is the second one.

I will check if I can reproduce the issue within a single test these days.

Jun 21 '19 19:06 saschagrunert

I concur with @saschagrunert. Multiple processes were removing the same image in parallel. My suspicion (did not look into the code yet) is that the 2nd process is blocked on the layer lock and is then trying to delete the layer which doesn't exist any more once the lock is acquired.

@saschagrunert, could you provide instructions to reproduce? I guess we need to add tests here in c/storage to make sure we don't regress.

Jun 24 '19 07:06 vrothberg

@saschagrunert, could you provide instructions to reproduce? I guess we need to add tests here in c/storage to make sure we don't regress.

This is going to be fun: I can reproduce the issue on the current CRI-O master via:

> sed -i 's;flakeAttempts=3;flakeAttempts=0;g' test/critest.bats
> sudo make clean
> make $(pwd)/build/bin/ginkgo
> sudo podman run --privileged -e STORAGE_OPTIONS="--storage-driver=vfs" -e TRAVIS=true \
    -e RUN_CRITEST=true -v $(pwd):/go/src/github.com/cri-o/cri-o \
    -w /go/src/github.com/cri-o/cri-o \
    -v $(pwd)/build/bin/ginkgo:/usr/bin/ginkgo -it saschagrunert/criocircle \
    bash -c 'make all test-binaries && test/test_runner.sh critest.bats'

At least this should trigger a failing test (in some cases), where the image :

# Summarizing 2 Failures:
#
# [Fail] [k8s.io] Image Manager [AfterEach] public image without tag should be pulled and removed [Conformance]
# /go/src/github.com/kubernetes-sigs/cri-tools/pkg/framework/util.go:310
#
# [Fail] [k8s.io] Image Manager [AfterEach] listImage should get exactly 3 repoTags in the result image [Conformance]
# /go/src/github.com/kubernetes-sigs/cri-tools/pkg/framework/util.go:294
#
# Ran 66 of 73 Specs in 47.746 seconds
# FAIL! -- 64 Passed | 2 Failed | 0 Pending | 7 Skipped

Jun 24 '19 09:06 saschagrunert

I guess we should ignore the error, if the image does not exists?

Jun 24 '19 10:06 rhatdan

I think that depends on the code path, but for deletion it makes sense to ignore "layer not known" errors.

Jun 24 '19 11:06 vrothberg

I'll spin up a PR.

Jun 24 '19 11:06 vrothberg

https://github.com/containers/storage/pull/374 could do the trick. I am still looking into how this can happen. The locks should protect against that case.

Jun 24 '19 12:06 vrothberg

The good news is that store.DeleteImage is fine. The issue comes from a TOCTOU triggered by multiple processes pulling and deleting simultaneously (see https://github.com/kubernetes-sigs/cri-tools/issues/472#issue-455696700).

Process P1 tries to pull image bar. P1 checks if bar's blobs are already in the storage and tries to reuse them later when committing the image. At the time of committing the image, another process P2 has already deleted the layers and the commit fails with Error committing the finished image: error locating layer for blob "sha256:65fc92611f38c5c3e31d5c0faec078c60f41103f2ad4a3700fff122c49aab358": layer not known.

I don't think there's a real solution to that as the storage is always subject of such races for resources.

Jun 24 '19 13:06 vrothberg

IMO, based on that, errors like this are better handled in CRI-O. So long as c/storage makes a consistent error return in cases like this, we can catch and discard any errors in the caller, preserving them in c/storage for cases where we do care about them.

Jun 24 '19 13:06 mheon

The problem is that the error does not occur during deletion but during pull, where we cannot discard the error as callers would be tricked into believing the image has been pulled successfully.

Jun 24 '19 13:06 vrothberg

Process P1 tries to pull image bar. P1 checks if bar's blobs are already in the storage and tries to reuse them later when committing the image. At the time of committing the image, another process P2 has already deleted the layers and the commit fails with Error committing the finished image: error locating layer for blob "sha256:65fc92611f38c5c3e31d5c0faec078c60f41103f2ad4a3700fff122c49aab358": layer not known.

Naively, in the ideal case I’d expect P1 to hold a “handle” / counted reference / something to the in-progress layer sequence, so that it is considered “busy” and P2 won’t be able to remove it. That “handle” could be a native c/storage object, or, maybe (very kludgy), an invisible image with a random name that points to the in-progress layer sequence as its TopLayer.

The only difficulty is with automatically dropping this “handle” if P1 dies unexpectedly.

Sure, CRI-O could catch this error and automatically try retrying the pull, but that’s not going to be able to accurately detect this situation (vs. real bugs, for example), and it’s not obvious to me how CRI-O could make a guarantee of forward progress.

Jun 24 '19 14:06 mtrmac

The only difficulty is with automatically dropping this “handle” if P1 dies unexpectedly.

I thought about repurposing the digest locks from (https://github.com/containers/image/pull/611) but I believe it's just moving the TOCTOU problem to another layer as between a pull A; run A the image could again be deleted.

Jun 24 '19 15:06 vrothberg

Isn’t that something the Kubelet is fully in control of, though? (Maybe not, I have never read all of it.)

The Kubelet could, at least in principle, if it is not doing so already, track that an image is being pulled to be run, and just not issue a “delete A” CRI request in the meantime.

The Kubelet can’t quite do this for individual layers because it’s never told about them, or at least not during the pull process.

Jun 24 '19 16:06 mtrmac

I really don't know much about the Kubelet, so I am not sure if that's in scope of it or not.

The Kubelet could, at least in principle, if it is not doing so already, track that an image is being pulled to be run, and just not issue a “delete A” CRI request in the meantime.

An in-memory implementation sounds feasible in c/storage and comparatively quick to implement (please don't nail me to it).

But ideally, it should work across processes which reminds me of https://github.com/containers/image/pull/61. When pulling blobs, they can be marked as "being pulled" by locking digest locks (in another specific directory) and released when the image is committed. When deleting an image, a layer can only be deleted if the lock isn't owned by somebody else - presumably via the same go tryAcquire/sleep/signal trick as in https://github.com/containers/image/pull/611. But as for 611, I think it's a lot of complexity for little gain.

Jun 24 '19 16:06 vrothberg

Ah, right, the “handles” can’t, for an ideal solution, be in-process-only because some other process using c/storage could come and delete apparently-unreferenced layers. That, together with the need to actually orphan the layers if the process pulling the image dies, does seem rather complex (similar to, but not quite the same as, containers/image#611, because that one focuses on exclusion / sequencing; this one should, ideally, allow several processes pulling an image in parallel to keep a handle to the same parent layer while they concurrently pull different child layers.)

Focusing on the single-process CRI-O situation should be much easier, and get us ~60% of the benefit.

Jun 25 '19 18:06 mtrmac

@vrothberg @mtrmac @nalind Is this still something we should pursue?

Nov 01 '19 08:11 rhatdan

I think so yes but solving it across processes is tough. As @mtrmac mentions above, it has a similar nature to https://github.com/containers/image/pull/611 (i.e., the blob-copy detection PR). Tough to get it right. Let's leave this issue open.

Nov 01 '19 10:11 vrothberg

@vrothberg This is still something you want to pursue correct?

Aug 03 '20 13:08 rhatdan

@nalind wants to have a look at it this week.

Aug 03 '20 13:08 vrothberg

storage storage copied to clipboard

'Layer not known' error on parallel image removal

storage
storage copied to clipboard