skaffold icon indicating copy to clipboard operation
skaffold copied to clipboard

The `--cache-artifacts=false` flag provoking race-condition and leads to the deletion of wrong images

Open AlexzAK opened this issue 1 year ago • 0 comments

Expected behavior

The --cache-artifacts=false flag works as specified in the docs.

Actual behavior

Some percent of the skaffold run --cache-artifcats=false end up with error ErrImagePull. For our project broken build happens in 10% of cases. Depends on the project.

Information

  • Skaffold version: v2.7.1
  • Operating system: Linux, Ubuntu 20.04.6
  • Installed via: GitHub Releases
  • Contents of skaffold.yaml: Not provided

Steps to reproduce the behavior

  1. Slow down image prune in the code.
  2. Run skaffold run --cache-artifcats=false many times.

In the file pkg/skaffold/build/local/local.go there is a code:

func (b *Builder) Build(ctx context.Context, out io.Writer, a *latest.Artifact) build.ArtifactBuilder {
	if b.prune {
		b.localPruner.asynchronousCleanupOldImages(ctx, []string{a.ImageName})
 	}
 	builder := build.WithLogFile(b.buildArtifact, b.muted)
 	return builder

It eventually calls cleanup with sync = false:

func (p *pruner) cleanup(ctx context.Context, sync bool, artifacts []string) {
	toPrune := p.collectImagesToPrune(ctx, artifacts)
	if len(toPrune) == 0 {
		return
	}

	if sync {
		err := p.runPrune(ctx, toPrune)
		if err != nil {
			log.Entry(ctx).Debugf("Failed to prune: %v", err)
		}
	} else {
		go func() {
			err := p.runPrune(ctx, toPrune)
			if err != nil {
				log.Entry(ctx).Debugf("Failed to prune: %v", err)
			}
		}()
	}
}

If there are a lot of fat docker images, then the runPrune will take some time to delete them.

Meanwhile the rebuild process is running concurrently and might to finish image rebuild before the runPrune call completes. If that happens you might see events in wrong order:

  • First rebuild the docker image by id
  • Second delete the docker image by id

If there were no changes in the docker file then the id of the image have not been changed. And the Skaffold deleted it. Due to that the project will not be able to start due to missing image ( ErrImagePull).

It is possible to run the prune with sync = true to fix that:

diff --git a/pkg/skaffold/build/local/local.go b/pkg/skaffold/build/local/local.go
index 368894b36..768a5d570 100644
--- a/pkg/skaffold/build/local/local.go
+++ b/pkg/skaffold/build/local/local.go
@@ -32,7 +32,7 @@ import (
 // its checksum. It streams build progress to the writer argument.
 func (b *Builder) Build(ctx context.Context, out io.Writer, a *latest.Artifact) build.ArtifactBuilder {
 	if b.prune {
-		b.localPruner.asynchronousCleanupOldImages(ctx, []string{a.ImageName})
+		b.localPruner.synchronousCleanupOldImages(ctx, []string{a.ImageName})
 	}
 	builder := build.WithLogFile(b.buildArtifact, b.muted)
 	return builder

But it kills the performance, because docker is slow to delete the images.

AlexzAK avatar Jan 11 '24 13:01 AlexzAK