storage icon indicating copy to clipboard operation
storage copied to clipboard

OOMs in zstd code with zstd:chunked

Open cgwalters opened this issue 1 year ago • 3 comments

I am consistently getting skopeo being OOM killed when running this to convert a local overlay image I have to an OCI zstd:chunked format (so I can easily inspect/debug it):

skopeo --dest-compress-format zstd:chunked containers-storage:localhost/bootc-bound oci:/var/home/walters/tmp/oci:bootc-zstd-chunked :arrow_right: [195791.594936] Out of memory: Killed process 8421 (skopeo) total-vm:12788772kB, anon-rss:7695604kB, file-rss:0kB, shmem-rss:0kB, UID:0 pgtables:16388kB oom_score_adj:0

And yeah, I only have 8G allocated to this podman-machine instance right now.

This being my first time needing to profile a Go application's memory usage, some searching on the internet helped me craft this patch to skopeo:

diff --git a/cmd/skopeo/main.go b/cmd/skopeo/main.go
index 8b8de3c7..64831579 100644
--- a/cmd/skopeo/main.go
+++ b/cmd/skopeo/main.go
@@ -3,6 +3,8 @@ package main
 import (
 	"context"
 	"fmt"
+	"os"
+	"runtime/pprof"
 	"strings"
 	"time"
 
@@ -33,6 +35,7 @@ type globalOptions struct {
 	commandTimeout     time.Duration           // Timeout for the command execution
 	registriesConfPath string                  // Path to the "registries.conf" file
 	tmpDir             string                  // Path to use for big temporary files
+	memprofile         string                  // Path to use for memory profiling
 }
 
 // requireSubcommand returns an error if no sub command is provided
@@ -91,6 +94,7 @@ func createApp() (*cobra.Command, *globalOptions) {
 		logrus.Fatal("unable to mark registries-conf flag as hidden")
 	}
 	rootCommand.PersistentFlags().StringVar(&opts.tmpDir, "tmpdir", "", "directory used to store temporary files")
+	rootCommand.PersistentFlags().StringVar(&opts.memprofile, "memprofile", "", "write memory profiles to this path")
 	flag := commonFlag.OptionalBoolFlag(rootCommand.Flags(), &opts.tlsVerify, "tls-verify", "Require HTTPS and verify certificates when accessing the registry")
 	flag.Hidden = true
 	rootCommand.AddCommand(
@@ -120,6 +124,23 @@ func (opts *globalOptions) before(cmd *cobra.Command, args []string) error {
 	if opts.tlsVerify.Present() {
 		logrus.Warn("'--tls-verify' is deprecated, please set this on the specific subcommand")
 	}
+	if opts.memprofile != "" {
+		go func() {
+			ticker := time.NewTicker(time.Second * 5)
+
+			for range ticker.C {
+				f, err := os.Create(opts.memprofile + ".tmp")
+				if err != nil {
+					panic(err)
+				}
+				pprof.WriteHeapProfile(f)
+				f.Close()
+				if err := os.Rename(opts.memprofile+".tmp", opts.memprofile); err != nil {
+					panic(err)
+				}
+			}
+		}()
+	}
 	return nil
 }
 

The results look like this:

(pprof) top
Showing nodes accounting for 3542.16MB, 99.63% of 3555.19MB total
Dropped 119 nodes (cum <= 17.78MB)
Showing top 10 nodes out of 20
      flat  flat%   sum%        cum   cum%
 2448.56MB 68.87% 68.87%  2448.56MB 68.87%  github.com/klauspost/compress/zstd.encoderOptions.encoder
    1072MB 30.15% 99.03%     1072MB 30.15%  github.com/klauspost/compress/zstd.(*fastBase).ensureHist (inline)
   18.77MB  0.53% 99.55%    18.77MB  0.53%  github.com/klauspost/compress/zstd.(*blockEnc).init
    2.83MB 0.079% 99.63%   278.78MB  7.84%  github.com/klauspost/compress/zstd.(*Encoder).Reset
         0     0% 99.63%   140.49MB  3.95%  github.com/containers/storage/pkg/chunked/compressor.newTarSplitData
         0     0% 99.63%  3415.66MB 96.08%  github.com/containers/storage/pkg/chunked/compressor.writeZstdChunkedStream
         0     0% 99.63%  3135.88MB 88.21%  github.com/containers/storage/pkg/chunked/compressor.writeZstdChunkedStream.func3
         0     0% 99.63%  3415.66MB 96.08%  github.com/containers/storage/pkg/chunked/compressor.zstdChunkedWriterWithLevel.func1
         0     0% 99.63%   278.78MB  7.84%  github.com/containers/storage/pkg/chunked/internal.ZstdWriterWithLevel
         0     0% 99.63%  3135.88MB 88.21%  github.com/klauspost/compress/zstd.(*Encoder).Close

That's only 2.4G of memory, but presumably I caught it just before some more serious memory usage inflation.

The next thing I tried was doing "pure" zstd (not zstd:chunked), and that seems quite reliable. The profiler seems to report pretty flat usage:

(pprof) top
Showing nodes accounting for 82.50MB, 92.96% of 88.74MB total
Showing top 10 nodes out of 136
      flat  flat%   sum%        cum   cum%
      64MB 72.12% 72.12%       64MB 72.12%  github.com/klauspost/compress/zstd.(*fastBase).ensureHist
    5.47MB  6.17% 78.28%     5.47MB  6.17%  github.com/klauspost/compress/zstd.encoderOptions.encoder
    4.63MB  5.21% 83.50%     4.63MB  5.21%  github.com/klauspost/pgzip.NewReader
    2.26MB  2.55% 86.04%     2.26MB  2.55%  github.com/klauspost/compress/zstd.(*blockEnc).init
    1.13MB  1.27% 87.32%     8.86MB  9.99%  github.com/klauspost/compress/zstd.(*Encoder).Reset
    1.01MB  1.14% 88.45%     1.01MB  1.14%  regexp/syntax.(*compiler).inst
       1MB  1.13% 89.58%        1MB  1.13%  github.com/modern-go/reflect2.(*unsafeType).UnsafeNew
       1MB  1.13% 90.71%     1.50MB  1.69%  github.com/tchap/go-patricia/v2/patricia.NewTrie
       1MB  1.13% 91.84%        1MB  1.13%  github.com/json-iterator/go.(*Iterator).ReadString
       1MB  1.13% 92.96%     2.50MB  2.82%  github.com/containers/storage/pkg/truncindex.(*TruncIndex).addID

cgwalters avatar Aug 04 '24 22:08 cgwalters