build-push-action
build-push-action copied to clipboard
Feature Request: Automated Recovery
I have a large Github Action workflow that pushes over 600 images up to the GitHub Container Registry.
This mostly works fine, except that I have to set max-parallels based on how many images I expect to be running at a time, and even then sometimes I'm hitting APIs too fast or getting a rare error.
For example:
buildx failed with: ERROR: failed to solve: failed to compute cache key: failed to copy: httpReadSeeker: failed open: unexpected status code https://ghcr.io/v2/swiftotter/den-php-fpm/blobs/sha256:456f646c7993e2a08fbdcbb09c191d153118d0c8675e1a0b29f83895c425105f: 500 Internal Server Error - Server message: unknown
or
buildx failed with: ERROR: failed to solve: failed to compute cache key: failed to copy: read tcp 172.17.0.2:59588->185.199.111.154:443: read: connection timed out
or
buildx failed with: ERROR: failed to solve: failed to do request: Head "https://ghcr.io/v2/swiftotter/den-php-fpm-debug/blobs/sha256:d6b642fadba654351d3fc430b0b9a51f7044351daaed3d27055b19044d29ec66": dial tcp: lookup ghcr.io on 168.63.129.16:53: read udp 172.17.0.2:40862->168.63.129.16:53: i/o timeout
These are all temporary errors that disappear the moment I re-run the job. Instead, what I wish for is that in such cases - like timeouts, or server errors or too many requests errors, that some sort of automated backoff retry system exists, with configurable limitations.
We are occasionally seeing errors like these as well. For the cache-related ones, I guess an option also could be to continue despite the failure, as writing to the cache probably doesn't constitute a critical failure that would require the entire workflow to fail.
We run into this somewhat often. Is a retry
/retries
option feasible for pushing images?
some additional info: https://github.com/ddelange/pycuda/actions/runs/3972373867/jobs/6830922090
#22 DONE 31.0s
#23 exporting to image
#23 exporting layers
#23 exporting layers 9.9s done
#23 exporting manifest sha256:5cc09704d37dcab52f35d0dc1163acdae52fbb9a265cbb1fe9d55625a61307e9 done
#23 exporting config sha256:50aea776612d2a2916b237afb2e1e59c96b8134105791f65029ac253797dc840 done
#23 exporting attestation manifest sha256:c76e2b4622ef918f2ecc26fd4710f0a68f3f9b57f744ba23ff8130a21b5b3a7d done
#23 exporting manifest sha256:b82c6873647b10e6c1d13f754a225756d879380d6d51983250c0167b1af84874 done
#23 exporting config sha256:af2a1fee215b05d4d8b7893bc20a305a7a257fed7cb59c1ab21715286c89d08f done
#23 exporting attestation manifest sha256:52a9106f51393e6435cbed771d7b4e288c820d84e745c681dbdcc3d3a72bc67d done
#23 ...
#24 [auth] ddelange/pycuda/jupyter:pull,push token for ghcr.io
#24 DONE 0.0s
#23 exporting to image
#23 exporting manifest list sha256:dd095d62b30f27dd9ee27b81a0eabd77ab15387dc44f6833686ff20a005452a2 done
#23 pushing layers
#23 pushing layers 2.0s done
#23 ERROR: failed to push ghcr.io/ddelange/pycuda/jupyter:3.9-master: failed to copy: io: read/write on closed pipe
these images are each around 2GB+, so a retry might again error if there's no 'resume' ie successful layers don't need to be pushed again
I would like to +1 this! For us, this happens maybe once a day just during continuous deployment (so not counting PRs).
The errors we're seeing look like transient infrastructure errors:
buildx failed with: ERROR: failed to solve: failed to push ghcr.io/<our_org_name>/<our_repo_name>/<our_image_name>:2023.06.14-1605-f0c78f6: failed to copy: failed to do request: Put "https://ghcr.io/v2/<our_org_name>/<our_repo_name>/<our_image_name>/blobs/upload/9508e842-68bb-4779-9f40-6d8cf25357ff?digest=sha256%3A7cb00f153a2766267a4fbe7b14f830de29010a56c96486af21b7b9bf3c8838f0": write tcp 172.17.0.2:35594->140.82.114.33:443: write: broken pipe
Having an internal option to retry the step would be fantastic. We don't want to retry the whole job, as that could mean re-running things that should not be retried, like the test suite.
We see this too, both on simple container image promotion for Fluent Bit releases - usually pushing to ghcr.io in parallel the 3 supported architectures at least one of them fails - but also when building the multi-arch images which is a huge time sink as it takes a long time to build with QEMU then just fails to push so we have to restart the whole lot again.
I'm seeing this very often now, especially with parallel builds.. This is without any flakiness reported on the GH side: https://www.githubstatus.com
I have tried the pinned versions of Buildkit v0.10.6
and v0.12.0
but that didn't seem to help much:
https://github.com/docker/build-push-action/issues/761#issuecomment-1645602086
It would be good to have more resilient retries (with the understanding that 100% reliability is obviously not achievable).
We're consistently (daily) seeing this class of errors, and a re-run always seems to resolve the issue. I suspect retries could help here, but in lieu of that, is there another solution available?
I noticed in https://github.com/moby/buildkit/pull/3430 there's an ignore-error=true
that applies only? to cache-to
. But I'm not convinced this would fix the reported (and our) issue.
Sample GHA workflow
Omitted some steps re. auth, but otherwise, this captures the relevant bits
steps:
- name: Set up Docker Buildx
uses: docker/setup-buildx-action@v3
with:
driver-opts: |
network=host
- name: build
uses: docker/build-push-action@v5
with:
cache-from: type=registry,ref=ghcr.io/org/${{ matrix.container }}:cache
cache-to: type=registry,ref=ghcr.io/org/${{ matrix.container }}:cache,mode=max
file: Dockerfile.${{ matrix.container }}
push: true
build-args: |
DOCKER_ORG=${{ env.PRIVATE_REGISTRY_URL }}
BASEBUILD_REVISION=${{ github.sha }}
GIT_REVISION=${{ github.sha }}
GIT_BRANCH=${{ github.ref_name }}
tags: |
${{ env.PRIVATE_REGISTRY_URL }}/${{ matrix.container }}:latest
${{ env.PRIVATE_REGISTRY_URL }}/${{ matrix.container }}:${{ github.sha }}
# With provenance: true, docker ends up pushing the image separetely into
# multiple files and manifests, which not all clients can read.
provenance: false
And the errors we're seeing:
Dockerfile.foo:8
--------------------
7 |
8 | >>> RUN --mount=type=cache,target=/root/.cache/go-build \
9 | >>> --mount=type=cache,target=/go/pkg/mod \
10 | >>> CGO_ENABLED=0 \
11 | >>> go build ...
--------------------
ERROR: failed to solve: failed to compute cache key: failed to copy: httpReadSeeker: failed open: unexpected status code https://us-docker.pkg.dev/v2/private-registry-1/org/repo/blobs/sha256:7000ed6e4e7e4306dd5132f9372b65e712919de4653f1a97c530c6fac9ad2e1f: 500 Internal Server Error
Error: buildx failed with: ERROR: failed to solve: failed to compute cache key: failed to copy: httpReadSeeker: failed open: unexpected status code https://us-docker.pkg.dev/v2/private-registry-1/org/repo/blobs/sha256:7000ed6e4e7e4306dd5132f9372b65e712919de4653f1a97c530c6fac9ad2e1f: 500 Internal Server Error
But I'm not convinced this would fix the reported (and out) issue.
Does it not work on your side?
Edit: Oh you mean when fetching cache I think right?
But I'm not convinced this would fix the reported (and out) issue.
Does it not work on your side?
Edit: Oh you mean when fetching cache I think right?
I haven't tried it (yet). What I meant was at a cursory glance it seems like ignoring errors on cache-to
may not be related, but I still don't quite understand the root cause of:
failed to compute cache key: failed to copy: httpReadSeeker: failed open: unexpected status ... <registry 500>
There was an issue with GCP Artifact Registry in the past 24 hours, so maybe my comment above is an isolated issue (although we noticed this error in the prior weeks).
~Anecdotally, every re-run resolves the issue which suggests it's less an issue with the registry and more with the cache? If true, then the proposal for retries would help in these scenarios.~ EDIT: this statement isn't correct, we subsequently hit 3 failures in a row.
Ended up disabling the cache, but at this point it's starting to become more likely it's a transient issue with the GCP Artifact Registry.
(Apologies for the noise).
uses: docker/build-push-action@v5
with:
+ no-cache: true
- cache-from: type=registry,ref=ghcr.io/org/${{ matrix.container }}:cache
- cache-to: type=registry,ref=ghcr.io/org/${{ matrix.container }}:cache,mode=max,ignore-error=true
I marked my comments above as off-topic since it was an isolated incident related to GCP artifact registry.
However, I'd still +1 this feature request. During that incident, retries would have been extremely useful.
Seeing this regularly.
#12 [backend 5/7] COPY Website .
#12 ERROR: failed to copy: read tcp 172.17.0.2:59196->20.209.147.161:443: read: connection timed out
------
...
--------------------
ERROR: failed to solve: failed to compute cache key: failed to copy: read tcp 172.17.0.2:59196->20.209.147.161:443: read: connection timed out
+1 on this feature, I get push-related errors around twice a day and retrying usually fixes it:
buildx failed with: ERROR: failed to solve: failed to push europe-west3-docker.pkg.dev/my-project/my-project/my-project:main: failed to authorize: failed to fetch oauth token: unexpected status from GET request to https://europe-west3-docker.pkg.dev/v2/token?scope=repository%3Amy-project%2Fmy-project%my-project%3Apull%2Cpush&service=europe-west3-docker.pkg.dev: 401 Unauthorized
+1 on this feature, I get push-related errors around twice a day and retrying usually fixes it:
buildx failed with: ERROR: failed to solve: failed to push europe-west3-docker.pkg.dev/my-project/my-project/my-project:main: failed to authorize: failed to fetch oauth token: unexpected status from GET request to https://europe-west3-docker.pkg.dev/v2/token?scope=repository%3Amy-project%2Fmy-project%my-project%3Apull%2Cpush&service=europe-west3-docker.pkg.dev: 401 Unauthorized
Is this feature under development, or is it still being considered? 👀
+1, We also experience transient issues that could be handled by retries.
+1 for backoff retries
Just wanted to share my solution to this problem. The most naive stuff, wait a minute and try again if it failed.
- name: Build and push Docker image
continue-on-error: true
id: buildx1
uses: docker/build-push-action
# ...and so on. Then:
- name: Wait to retry
if: steps.buildx1.outcome != 'success'
run: |
sleep 60
- name: Build and push Docker image
uses: docker/build-push-action
if: steps.buildx1.outcome != 'success'
# ...and so on
This has reduced my random failures to almost 0 I must say