build-push-action Feature Request: Automated Recovery

I have a large Github Action workflow that pushes over 600 images up to the GitHub Container Registry.

This mostly works fine, except that I have to set max-parallels based on how many images I expect to be running at a time, and even then sometimes I'm hitting APIs too fast or getting a rare error.

For example:

buildx failed with: ERROR: failed to solve: failed to compute cache key: failed to copy: httpReadSeeker: failed open: unexpected status code https://ghcr.io/v2/swiftotter/den-php-fpm/blobs/sha256:456f646c7993e2a08fbdcbb09c191d153118d0c8675e1a0b29f83895c425105f: 500 Internal Server Error - Server message: unknown

or

buildx failed with: ERROR: failed to solve: failed to compute cache key: failed to copy: read tcp 172.17.0.2:59588->185.199.111.154:443: read: connection timed out

or

buildx failed with: ERROR: failed to solve: failed to do request: Head "https://ghcr.io/v2/swiftotter/den-php-fpm-debug/blobs/sha256:d6b642fadba654351d3fc430b0b9a51f7044351daaed3d27055b19044d29ec66": dial tcp: lookup ghcr.io on 168.63.129.16:53: read udp 172.17.0.2:40862->168.63.129.16:53: i/o timeout

These are all temporary errors that disappear the moment I re-run the job. Instead, what I wish for is that in such cases - like timeouts, or server errors or too many requests errors, that some sort of automated backoff retry system exists, with configurable limitations.

Nov 04 '22 17:11 navarr

We are occasionally seeing errors like these as well. For the cache-related ones, I guess an option also could be to continue despite the failure, as writing to the cache probably doesn't constitute a critical failure that would require the entire workflow to fail.

Nov 16 '22 11:11 Tenzer

We run into this somewhat often. Is a retry/retries option feasible for pushing images?

Jan 12 '23 16:01 nick-at-work

some additional info: https://github.com/ddelange/pycuda/actions/runs/3972373867/jobs/6830922090

#22 DONE 31.0s

#23 exporting to image
#23 exporting layers
#23 exporting layers 9.9s done
#23 exporting manifest sha256:5cc09704d37dcab52f35d0dc1163acdae52fbb9a265cbb1fe9d55625a61307e9 done
#23 exporting config sha256:50aea776612d2a2916b237afb2e1e59c96b8134105791f65029ac253797dc840 done
#23 exporting attestation manifest sha256:c76e2b4622ef918f2ecc26fd4710f0a68f3f9b57f744ba23ff8130a21b5b3a7d done
#23 exporting manifest sha256:b82c6873647b10e6c1d13f754a225756d879380d6d51983250c0167b1af84874 done
#23 exporting config sha256:af2a1fee215b05d4d8b7893bc20a305a7a257fed7cb59c1ab21715286c89d08f done
#23 exporting attestation manifest sha256:52a9106f51393e6435cbed771d7b4e288c820d84e745c681dbdcc3d3a72bc67d done
#23 ...

#24 [auth] ddelange/pycuda/jupyter:pull,push token for ghcr.io
#24 DONE 0.0s

#23 exporting to image
#23 exporting manifest list sha256:dd095d62b30f27dd9ee27b81a0eabd77ab15387dc44f6833686ff20a005452a2 done
#23 pushing layers
#23 pushing layers 2.0s done
#23 ERROR: failed to push ghcr.io/ddelange/pycuda/jupyter:3.9-master: failed to copy: io: read/write on closed pipe

these images are each around 2GB+, so a retry might again error if there's no 'resume' ie successful layers don't need to be pushed again

Jan 25 '23 08:01 ddelange

I would like to +1 this! For us, this happens maybe once a day just during continuous deployment (so not counting PRs).

The errors we're seeing look like transient infrastructure errors:

buildx failed with: ERROR: failed to solve: failed to push ghcr.io/<our_org_name>/<our_repo_name>/<our_image_name>:2023.06.14-1605-f0c78f6: failed to copy: failed to do request: Put "https://ghcr.io/v2/<our_org_name>/<our_repo_name>/<our_image_name>/blobs/upload/9508e842-68bb-4779-9f40-6d8cf25357ff?digest=sha256%3A7cb00f153a2766267a4fbe7b14f830de29010a56c96486af21b7b9bf3c8838f0": write tcp 172.17.0.2:35594->140.82.114.33:443: write: broken pipe

Having an internal option to retry the step would be fantastic. We don't want to retry the whole job, as that could mean re-running things that should not be retried, like the test suite.

Jun 14 '23 22:06 kkom

We see this too, both on simple container image promotion for Fluent Bit releases - usually pushing to ghcr.io in parallel the 3 supported architectures at least one of them fails - but also when building the multi-arch images which is a huge time sink as it takes a long time to build with QEMU then just fails to push so we have to restart the whole lot again.

Jun 26 '23 15:06 patrick-stephens

I'm seeing this very often now, especially with parallel builds.. This is without any flakiness reported on the GH side: https://www.githubstatus.com

I have tried the pinned versions of Buildkit v0.10.6 and v0.12.0 but that didn't seem to help much: https://github.com/docker/build-push-action/issues/761#issuecomment-1645602086

It would be good to have more resilient retries (with the understanding that 100% reliability is obviously not achievable).

Jul 25 '23 23:07 dinvlad

We're consistently (daily) seeing this class of errors, and a re-run always seems to resolve the issue. I suspect retries could help here, but in lieu of that, is there another solution available?

I noticed in https://github.com/moby/buildkit/pull/3430 there's an ignore-error=true that applies only? to cache-to. But I'm not convinced this would fix the reported (and our) issue.

Sample GHA workflow

Omitted some steps re. auth, but otherwise, this captures the relevant bits

steps:
- name: Set up Docker Buildx
  uses: docker/setup-buildx-action@v3
  with:
    driver-opts: |
      network=host
- name: build
  uses: docker/build-push-action@v5
  with:
    cache-from: type=registry,ref=ghcr.io/org/${{ matrix.container }}:cache
    cache-to: type=registry,ref=ghcr.io/org/${{ matrix.container }}:cache,mode=max
    file: Dockerfile.${{ matrix.container }}
    push: true
    build-args: |
      DOCKER_ORG=${{ env.PRIVATE_REGISTRY_URL }}
      BASEBUILD_REVISION=${{ github.sha }}
      GIT_REVISION=${{ github.sha }}
      GIT_BRANCH=${{ github.ref_name }}
    tags: |
      ${{ env.PRIVATE_REGISTRY_URL }}/${{ matrix.container }}:latest
      ${{ env.PRIVATE_REGISTRY_URL }}/${{ matrix.container }}:${{ github.sha }}
    # With provenance: true, docker ends up pushing the image separetely into
    # multiple files and manifests, which not all clients can read.
    provenance: false

And the errors we're seeing:

Dockerfile.foo:8
--------------------
   7 |     
   8 | >>> RUN --mount=type=cache,target=/root/.cache/go-build \
   9 | >>>     --mount=type=cache,target=/go/pkg/mod \
  10 | >>>     CGO_ENABLED=0 \
  11 | >>>     go build ...
--------------------
ERROR: failed to solve: failed to compute cache key: failed to copy: httpReadSeeker: failed open: unexpected status code https://us-docker.pkg.dev/v2/private-registry-1/org/repo/blobs/sha256:7000ed6e4e7e4306dd5132f9372b65e712919de4653f1a97c530c6fac9ad2e1f: 500 Internal Server Error
Error: buildx failed with: ERROR: failed to solve: failed to compute cache key: failed to copy: httpReadSeeker: failed open: unexpected status code https://us-docker.pkg.dev/v2/private-registry-1/org/repo/blobs/sha256:7000ed6e4e7e4306dd5132f9372b65e712919de4653f1a97c530c6fac9ad2e1f: 500 Internal Server Error

Oct 18 '23 16:10 mfridman

But I'm not convinced this would fix the reported (and out) issue.

Does it not work on your side?

Edit: Oh you mean when fetching cache I think right?

Oct 18 '23 16:10 crazy-max

But I'm not convinced this would fix the reported (and out) issue.

Does it not work on your side?

Edit: Oh you mean when fetching cache I think right?

I haven't tried it (yet). What I meant was at a cursory glance it seems like ignoring errors on cache-to may not be related, but I still don't quite understand the root cause of:

failed to compute cache key: failed to copy: httpReadSeeker: failed open: unexpected status ... <registry 500>

There was an issue with GCP Artifact Registry in the past 24 hours, so maybe my comment above is an isolated issue (although we noticed this error in the prior weeks).

~Anecdotally, every re-run resolves the issue which suggests it's less an issue with the registry and more with the cache? If true, then the proposal for retries would help in these scenarios.~ EDIT: this statement isn't correct, we subsequently hit 3 failures in a row.

Ended up disabling the cache, but at this point it's starting to become more likely it's a transient issue with the GCP Artifact Registry.

(Apologies for the noise).

        uses: docker/build-push-action@v5
        with:
+         no-cache: true
-         cache-from: type=registry,ref=ghcr.io/org/${{ matrix.container }}:cache
-         cache-to: type=registry,ref=ghcr.io/org/${{ matrix.container }}:cache,mode=max,ignore-error=true

Oct 18 '23 16:10 mfridman

I marked my comments above as off-topic since it was an isolated incident related to GCP artifact registry.

However, I'd still +1 this feature request. During that incident, retries would have been extremely useful.

Oct 20 '23 15:10 mfridman

Seeing this regularly.

#12 [backend 5/7] COPY Website .
#12 ERROR: failed to copy: read tcp 172.17.0.2:59196->20.209.147.161:443: read: connection timed out
------
...
--------------------
ERROR: failed to solve: failed to compute cache key: failed to copy: read tcp 172.17.0.2:59196->20.209.147.161:443: read: connection timed out

Nov 10 '23 11:11 Santas

+1 on this feature, I get push-related errors around twice a day and retrying usually fixes it:

buildx failed with: ERROR: failed to solve: failed to push europe-west3-docker.pkg.dev/my-project/my-project/my-project:main: failed to authorize: failed to fetch oauth token: unexpected status from GET request to https://europe-west3-docker.pkg.dev/v2/token?scope=repository%3Amy-project%2Fmy-project%my-project%3Apull%2Cpush&service=europe-west3-docker.pkg.dev: 401 Unauthorized

Mar 01 '24 07:03 tonynajjar

+1 on this feature, I get push-related errors around twice a day and retrying usually fixes it:

buildx failed with: ERROR: failed to solve: failed to push europe-west3-docker.pkg.dev/my-project/my-project/my-project:main: failed to authorize: failed to fetch oauth token: unexpected status from GET request to https://europe-west3-docker.pkg.dev/v2/token?scope=repository%3Amy-project%2Fmy-project%my-project%3Apull%2Cpush&service=europe-west3-docker.pkg.dev: 401 Unauthorized

Mar 16 '24 00:03 richaarora01

Is this feature under development, or is it still being considered? 👀

Apr 09 '24 07:04 DhanshreeA

+1, We also experience transient issues that could be handled by retries.

Jun 05 '24 18:06 tmccoy

+1 for backoff retries

Jun 11 '24 12:06 mcbenjemaa

Just wanted to share my solution to this problem. The most naive stuff, wait a minute and try again if it failed.

      - name: Build and push Docker image
        continue-on-error: true
        id: buildx1
        uses: docker/build-push-action
        # ...and so on. Then:
      - name: Wait to retry
        if: steps.buildx1.outcome != 'success'
        run: |
          sleep 60
      - name: Build and push Docker image
        uses: docker/build-push-action
        if: steps.buildx1.outcome != 'success'
        # ...and so on

This has reduced my random failures to almost 0 I must say

Jun 14 '24 07:06 eiriksm

build-push-action build-push-action copied to clipboard

Feature Request: Automated Recovery

build-push-action
build-push-action copied to clipboard