The `--cache-artifacts=false` flag provoking race-condition and leads to the deletion of wrong images
Expected behavior
The --cache-artifacts=false flag works as specified in the docs.
Actual behavior
Some percent of the skaffold run --cache-artifcats=false end up with error ErrImagePull. For our project broken build happens in 10% of cases. Depends on the project.
Information
- Skaffold version: v2.7.1
- Operating system: Linux, Ubuntu 20.04.6
- Installed via: GitHub Releases
- Contents of skaffold.yaml: Not provided
Steps to reproduce the behavior
- Slow down image prune in the code.
- Run
skaffold run --cache-artifcats=falsemany times.
In the file pkg/skaffold/build/local/local.go there is a code:
func (b *Builder) Build(ctx context.Context, out io.Writer, a *latest.Artifact) build.ArtifactBuilder {
if b.prune {
b.localPruner.asynchronousCleanupOldImages(ctx, []string{a.ImageName})
}
builder := build.WithLogFile(b.buildArtifact, b.muted)
return builder
It eventually calls cleanup with sync = false:
func (p *pruner) cleanup(ctx context.Context, sync bool, artifacts []string) {
toPrune := p.collectImagesToPrune(ctx, artifacts)
if len(toPrune) == 0 {
return
}
if sync {
err := p.runPrune(ctx, toPrune)
if err != nil {
log.Entry(ctx).Debugf("Failed to prune: %v", err)
}
} else {
go func() {
err := p.runPrune(ctx, toPrune)
if err != nil {
log.Entry(ctx).Debugf("Failed to prune: %v", err)
}
}()
}
}
If there are a lot of fat docker images, then the runPrune will take some time to delete them.
Meanwhile the rebuild process is running concurrently and might to finish image rebuild before the runPrune call completes. If that happens you might see events in wrong order:
- First rebuild the docker image by id
- Second delete the docker image by id
If there were no changes in the docker file then the id of the image have not been changed. And the Skaffold deleted it. Due to that the project will not be able to start due to missing image ( ErrImagePull).
It is possible to run the prune with sync = true to fix that:
diff --git a/pkg/skaffold/build/local/local.go b/pkg/skaffold/build/local/local.go
index 368894b36..768a5d570 100644
--- a/pkg/skaffold/build/local/local.go
+++ b/pkg/skaffold/build/local/local.go
@@ -32,7 +32,7 @@ import (
// its checksum. It streams build progress to the writer argument.
func (b *Builder) Build(ctx context.Context, out io.Writer, a *latest.Artifact) build.ArtifactBuilder {
if b.prune {
- b.localPruner.asynchronousCleanupOldImages(ctx, []string{a.ImageName})
+ b.localPruner.synchronousCleanupOldImages(ctx, []string{a.ImageName})
}
builder := build.WithLogFile(b.buildArtifact, b.muted)
return builder
But it kills the performance, because docker is slow to delete the images.