containerd icon indicating copy to clipboard operation
containerd copied to clipboard

[CRI] Remove image store

Open mxpv opened this issue 3 years ago • 15 comments

This PR refactors CRI and removes in-memory image store in favor of containerd's metadata image store. The goal is to simplify CRI code and rely more on containerd APIs instead of maintaining custom layers.

So instead of in-memory cache, this PR relies on containerd’s metadata store (and labels) to keep additional image information needed by CRI. It preserves existing logic with creating a separate image per reference, but now appends appropriate labels, so we can just use boltdb.

Labels can be appended on demand, on event, or at daemon start - some of these may be removed in 2.0. Currently labels that we add - config digest (that CRI uses for image ID), image size, chain ID, etc.

In the new implementation search for image references narrows down to querying metadata store with all records that contain same image ID label. Queries are slower comparing to the original implementation, but boltdb still reasonably fast (also there is room for optimization if that will be a bottleneck).

mxpv avatar Jun 14 '22 23:06 mxpv

I've no idea hot to fix Fuzzing stuff :( This was introduced in https://github.com/containerd/containerd/pull/7052 @AdamKorcz any ideas?

mxpv avatar Jun 15 '22 19:06 mxpv

@mikebrow this PR replaces entire image store with a set of labels assigned to Image (so it's already decoupled). Moving some labels elsewhere would be super straightforward.

mxpv avatar Jun 16 '22 04:06 mxpv

Rebased against main to apply fix for Fuzz CI

mxpv avatar Jun 16 '22 04:06 mxpv

What sort of performance numbers are we looking at here (moving from the in memory list).. Kubelet has a 30sec loop requesting the full list of images in its namespace.

https://github.com/kubernetes/kubernetes/blob/master/pkg/kubelet/images/image_gc_manager.go#L195-L203

mikebrow avatar Jun 18 '22 13:06 mikebrow

@mikebrow I wrote a quick benchmark to get order of numbers.

30,000 images, 3 references per image (so 90.000 metadata image records in boltdb), takes ~2 seconds on MBP 2019.

https://gist.github.com/mxpv/8811a208ebee32af5e0fe8155a751f10

mxpv avatar Jun 20 '22 03:06 mxpv

@mxpv Do you think the image ID labels will be a long terms (post 2.0) solution? The existing codes makes some assumptions about single platform that this would mostly inherit. We could come up with platform specific labels as well to handle that, but not sure we have fully thought about the multi-platform use case for CRI images.

dmcgowan avatar Jun 20 '22 22:06 dmcgowan

@dmcgowan I haven't put much though into it, we can bring this on the next community call. I'd like to learn more about potential use cases of this. Labels are not very convenient to work with, so may be for 2.0 and post we can rethink image store structure a bit and have image ID a first class citizen. So we can, for instance, keep 1 metadata record for image and all references to it, so there will be no need to gather references from store, and have image ID guaranteed to be unique and platform independent.

mxpv avatar Jun 20 '22 23:06 mxpv

I have concerned about the cpu and memory burst issue.

At 2018, we used containerd's ListImages remote API in pouchcontainer. And we found that the containerd's cpu is getting high periodically. I remembered that the node has around 30 images. The profile showed us that the containerd uses more cpu to read json from disk and unmarshal it. We have to use cache in pouchcontainer...

I uses local plugin to make benchmark for it. The patch is here.

## current design
➜  integration git:(benchmark-cri-image-list) sudo -E $(command -v go) test -count=1 -bench=. -benchmem -v -run BenchmarkCRIList -memprofile=/tmp/mem.profile -cpuprofile=/tmp/cpu.profile
INFO[0000] Using the following image list: {Alpine:docker.io/library/alpine:latest BusyBox:docker.io/library/busybox:latest Pause:k8s.gcr.io/pause:3.7 ResourceConsumer:k8s.gcr.io/e2e-test-images/resource-consumer:1.10 VolumeCopyUp:ghcr.io/containerd/volume-copy-up:2.1 VolumeOwnership:ghcr.io/containerd/volume-ownership:2.1} 
goos: linux
goarch: amd64
pkg: github.com/containerd/containerd/integration
cpu: 11th Gen Intel(R) Core(TM) i5-1135G7 @ 2.40GHz
BenchmarkCRIList13Images
BenchmarkCRIList13Images-8         13789             87710 ns/op           28010 B/op        317 allocs/op
BenchmarkCRIList26Images
BenchmarkCRIList26Images-8          5893            178976 ns/op           54217 B/op        609 allocs/op
PASS
ok      github.com/containerd/containerd/integration    3.559s

### this patch
➜  integration git:(testing-7061) sudo -E $(command -v go) test -count=1 -bench=. -benchmem -v -run BenchmarkCRIList -memprofile=/tmp/mem.profile -cpuprofile=/tmp/cpu.profile
INFO[0000] Using the following image list: {Alpine:docker.io/library/alpine:latest BusyBox:docker.io/library/busybox:latest Pause:k8s.gcr.io/pause:3.7 ResourceConsumer:k8s.gcr.io/e2e-test-images/resource-consumer:1.10 VolumeCopyUp:ghcr.io/containerd/volume-copy-up:2.1 VolumeOwnership:ghcr.io/containerd/volume-ownership:2.1} 
goos: linux
goarch: amd64
pkg: github.com/containerd/containerd/integration
cpu: 11th Gen Intel(R) Core(TM) i5-1135G7 @ 2.40GHz
BenchmarkCRIList13Images
BenchmarkCRIList13Images-8          1012           1114956 ns/op          417320 B/op       4332 allocs/op
BenchmarkCRIList26Images
BenchmarkCRIList26Images-8           567           2093443 ns/op          788533 B/op       8445 allocs/op
PASS
ok      github.com/containerd/containerd/integration    3.061s

If we walk the disk to get the json, the cpu profile will be like:

image

Kubelet uses 30s to get image list which is not too bad. But the cache layer can make containerd working in low overhead. Just my two cents.

fuweid avatar Jun 22 '22 17:06 fuweid

Pushed https://github.com/containerd/containerd/pull/7061/commits/1eb20bf1d5a90b48e76fca7ea1375be606cec686 to omit reading from disk / unmarhsalling. We needed that just for User field, so its easy to workaround it with one more label instead.

mxpv avatar Jun 22 '22 17:06 mxpv

ping @mikebrow @fuweid @dmcgowan is there anything else I can do here to push this forward?

mxpv avatar Jul 05 '22 20:07 mxpv

  1. copying the windows/linux resource fields over here: https://github.com/containerd/containerd/pull/6517/files#diff-c2be5fd4b479b8b7191b7800ba25be98107c143e0a13575917220a4bfebe132aR101-R104 to eliminate the spec cost for the status checks...

  2. We are talking about the possibility of adding image events to the PLEG work, which would allow us to change the once every 30s sync cycle check for the image list to something much greater.. The only reason to keep the image sync cycle at all will be to ensure no events were lost.

mikebrow avatar Jul 05 '22 20:07 mikebrow

copying the windows/linux resource fields over here: https://github.com/containerd/containerd/pull/6517/files#diff-c2be5fd4b479b8b7191b7800ba25be98107c143e0a13575917220a4bfebe132aR101-R104 to eliminate the spec cost for the status checks...

I've removed image spec reading in https://github.com/containerd/containerd/commit/1eb20bf1d5a90b48e76fca7ea1375be606cec686, this should reduce CPU pressure. I'm a bit hesitant to embed big chunks of the runtime spec, as it's easy to hit label size limit, it's much easier to add specific fields depending on requirements.

We are talking about the possibility of adding image events to the PLEG work, which would allow us to change the once every 30s sync cycle check for the image list to something much greater.. The only reason to keep the image sync cycle at all will be to ensure no events were lost.

Do we want to keep this until we get PLEG events work in? I have further CRI work which is blocked by this PR :(

mxpv avatar Jul 05 '22 21:07 mxpv

@mxpv Sorry for the late reply. I will take a look in two days. Will update later~

fuweid avatar Jul 06 '22 15:07 fuweid

Do we want to keep this until we get PLEG events work in? I have further CRI work which is blocked by this PR :(

TLDR can we move the cache down or just rewrite it in the form that you need it to be in for that which is blocked?

The PLEG work for image events isn't in plan for v1.25 kubernetes, just the container state events, and with a possible stretch for pod events. The KEP actually states: "Addressing container image relisting via CRI events is out of scope for this enhancement at this point in time."

The KEP feature change window for v1.25 closed a couple weeks back. We could do image events for v1.26. But that won't fix it (the need for caching) right away due to feature gating and service streams.

The PLEG feature(s) would first need to go from alpha to beta. K8s features start with feature gates that are off by default in alpha, and are then enabled by default in a subsequent release, then the feature gate goes away when the feature goes GA in a further subsequent release. Having big performance problems with the gate off would be seen as a regression.

Then because of the decoupled nature of containerd/kubelet older versions of kubernetes will be using newer versions of containerd, thus we would need the support stream for k8s that does not have the image PLEG events GA version to EOL (they use an n-2 support strategy) and then we can deprecate our cache on the future version of containerd that is only supported on versions of k8s that don't need the 30s image list cache.

And non-kubelet users of CRI would also need to adopt the events model or accept the performance change.

I think we need a cache of some sort for CRI image listing at least until this time next year possibly 18-24months.

mikebrow avatar Jul 06 '22 15:07 mikebrow

side note: This is another example where we need switches for optimizing for memory vs cpu vs disk ... resources.

mikebrow avatar Jul 06 '22 15:07 mikebrow

@mxpv This needs a rebase.

samuelkarp avatar Oct 26 '22 01:10 samuelkarp

This needs to be reworked. I'll get back to this PR once PLEG is adopted.

mxpv avatar Jan 12 '23 03:01 mxpv