storage Reuse pgzip reader to avoid excessive buffer allocation

Here is an output from pulling a jboss wildfly image with a debugging diff:

INFO[2021-08-18 14:10:31.535974307-07:00] Pulling image: docker.io/jboss/wildfly:latest  id=8c2b83df-c6b9-4718-9164-7a92a34c29fc name=/runtime.v1alpha2.ImageService/PullImage
INFO[2021-08-18 14:10:31.578657001-07:00] Trying to access "docker.io/jboss/wildfly:latest" 
INFO[2021-08-18 14:10:33.410380288-07:00] Trying to access "docker.io/jboss/wildfly:latest" 
Allocating 4 blocks of size 1048576
Allocating 4 blocks of size 1048576
Allocating 4 blocks of size 1048576
Allocating 4 blocks of size 1048576
Allocating 4 blocks of size 1048576
Allocating 4 blocks of size 1048576
Allocating 4 blocks of size 1048576
Allocating buffer in pool
Allocating 4 blocks of size 1048576
Allocating buffer in pool
Allocating 4 blocks of size 1048576
Allocating 4 blocks of size 1048576
Allocating 4 blocks of size 1048576
Allocating 4 blocks of size 1048576
Allocating buffer in pool
Allocating 4 blocks of size 1048576
Allocating 4 blocks of size 1048576
Allocating 4 blocks of size 1048576
Allocating 4 blocks of size 1048576
Allocating buffer in pool
Allocating 4 blocks of size 1048576
Allocating 4 blocks of size 1048576
Allocating 4 blocks of size 1048576
Allocating 4 blocks of size 1048576
Allocating buffer in pool
Allocating 4 blocks of size 1048576
Allocating 4 blocks of size 1048576
Allocating 4 blocks of size 1048576
Allocating 4 blocks of size 1048576
Allocating 4 blocks of size 1048576
Allocating 4 blocks of size 1048576
Allocating 4 blocks of size 1048576
Allocating 4 blocks of size 1048576
Allocating 4 blocks of size 1048576
Allocating 4 blocks of size 1048576
Allocating 4 blocks of size 1048576
Allocating 4 blocks of size 1048576
Allocating 4 blocks of size 1048576
Allocating 4 blocks of size 1048576
Allocating 4 blocks of size 1048576
Allocating 4 blocks of size 1048576
Allocating 4 blocks of size 1048576
Allocating 4 blocks of size 1048576
Allocating 4 blocks of size 1048576
Allocating 4 blocks of size 1048576
Allocating buffer in pool
INFO[2021-08-18 14:10:49.887144958-07:00] Pulled image: docker.io/jboss/wildfly@sha256:1f39dbbe874367e008c6d70fa687f1afec00a0304a53a8cf7d832b77221f6922  id=8c2b83df-c6b9-4718-9164-7a92a34c29fc name=/runtime.v1alpha2.ImageService/PullImage

Diff:

diff --git a/vendor/github.com/klauspost/pgzip/gunzip.go b/vendor/github.com/klauspost/pgzip/gunzip.go
index d1ae730b2..5eab78acb 100644
--- a/vendor/github.com/klauspost/pgzip/gunzip.go
+++ b/vendor/github.com/klauspost/pgzip/gunzip.go
@@ -17,6 +17,7 @@ package pgzip
 import (
        "bufio"
        "errors"
+       "fmt"
        "hash"
        "hash/crc32"
        "io"
@@ -118,6 +119,7 @@ func NewReader(r io.Reader) (*Reader, error) {
        z.multistream = true
        z.blockPool = make(chan []byte, z.blocks)
        for i := 0; i < z.blocks; i++ {
+               fmt.Printf("Allocating %v blocks of size %v\n", z.blocks, z.blockSize)
                z.blockPool <- make([]byte, z.blockSize)
        }
        if err := z.readHeader(true); err != nil {

There is a ReaderN method that allows us to modify the number of blocks allocated. Also, we could benefit from allocating the gzip Reader from a sync.Pool, so there is better reuse of the buffers during image pulls.

@nalind @giuseppe @rhatdan @umohnani8

Aug 18 '21 21:08 mrunalp

@mtrmac

Aug 18 '21 22:08 mrunalp

https://github.com/emicklei/go-restful/blob/v3/compressor_cache.go https://github.com/emicklei/go-restful/blob/v3/compressor_pools.go implements a cache and a pool of gzip reader/writers

Aug 18 '21 22:08 mrunalp

they don't seem like many allocations, but we probably improve it as you suggested with a pool of decompressors.

We can use Reader.Reset(r io.Reader) to re-use an existing allocated decompressor with a new stream

Aug 19 '21 07:08 giuseppe

It looks like ~160 MB for that one image and if there are concurrent image pulls we can spike very high. Instead we can have a max layer pulls knob and tie the allocations based on that, so our memory usage is more predictable.

Aug 19 '21 14:08 mrunalp

Does reuse actually help spikes? If there are concurrent image pulls, that implies concurrent buffer / pgzip uses, so reusing allocations over time does not decrease the amount of memory actually necessary.

Sure, making work easier for the GC to avoid the GC over-allocating might still be very useful (tens of percent of memory usage is a real memory), but ultimately spike prevention and predictability might require actually limiting the concurrency, much more than reducing allocations.

Aug 19 '21 15:08 mtrmac

Isn't cri-o pulling in a dedicated cgroup? If so, we could (in addition) set memory limits there.

Aug 19 '21 15:08 vrothberg

Isn't cri-o pulling in a dedicated cgroup? If so, we could (in addition) set memory limits there.

That’s an extremely important point. “In a dedicated cgroup” actually means in a separate process (https://github.com/cri-o/cri-o/blob/abbfb13221625fa8145253272079723dea4f742d/internal/storage/image.go#L467 ), so any work we could do to reduce allocations across copies would be completely ineffective. Any reuse in that case would be scoped to a single copy.Image.

But I can see nothing setting SeparatePullCgroup in the CRI-O repo nor in MCO. I might be missing a place.

Does anyone know for certain?

Aug 19 '21 15:08 mtrmac

The pull in cgroup isn't enabled by default today. That's something we have to test for various scenarios. The main motivation to enable it is charging the image pull to a pod. However, a pod could specify a low memory limit like ~20MB and we want image pulls to succeed even for such cases.

Aug 19 '21 15:08 mrunalp

Sure, making work easier for the GC to avoid the GC over-allocating might still be very useful (tens of percent of memory usage is a real memory), but ultimately spike prevention and predictability might require actually limiting the concurrency, much more than reducing allocations.

Agree that limiting the concurrency will limit how much we allocate at a time. I am advocating for both. If we want to do only one at a time then we can start with limiting the concurrency.

Aug 19 '21 15:08 mrunalp

It looks like ~160 MB for that one image and if there are concurrent image pulls we can spike very high. Instead we can have a max layer pulls knob and tie the allocations based on that, so our memory usage is more predictable.

BTW the standard library gzip implementation uses about 32 KB, instead of 4 MB. That would, I think, make this problem pretty much completely go away; the price we would pay is, per https://github.com/containers/storage/pull/255 , “30% improvement in the wall clock time”.

Aug 19 '21 16:08 mtrmac

BTW the standard library gzip implementation uses about 32 KB, instead of 4 MB. That would, I think, make this problem pretty much completely go away; the price we would pay is, per #255 , “30% improvement in the wall clock time”.

Some data in https://github.com/containers/image/issues/1349#issuecomment-902057119 .

Aug 19 '21 16:08 mtrmac

We could, in theory, synchronize across processes as well by means of shared memory. But making it opt-in for one process space would be my favored approach (e.g., a semaphore that can be passed to copy.Options).

@mtrmac @mrunalp WDYT?

Aug 20 '21 12:08 vrothberg

We could, in theory, synchronize across processes as well by means of shared memory.

Way too much infrastructure (shared memory across what scope (all of the UID is not obviously what the user wants)? how do we recover from pulls starting, claiming one of the limited number of pulls, and crashing?), at least now when we aren’t actually using the out-of-process pulls.

But making it opt-in for one process space would be my favored approach (e.g., a semaphore that can be passed to copy.Options).

Yes, that seems like a great option (c/image already uses a semaphore for limiting concurrency, so we would just expand the scope).

Aug 20 '21 14:08 mtrmac

Yeah, opt-in sounds good to me. Also, we will need to tune it in OpenShift for different environments depending on how much resources they have.

Aug 20 '21 15:08 mrunalp

For the record, letting the caller of copy.Image limit concurrency was implemented by @vrothberg in https://github.com/containers/image/pull/1356 .

Aug 24 '21 14:08 mtrmac