buildx
buildx copied to clipboard
GCP Cloud Build 1 hour timeout - failed to fetch oauth token: unexpected status: 401 Unauthorized
Background
At my company, I have a new Apple M1 MacBook Pro. Most of our build infrastructure is all amd64
images, which run very slow and flakey on the arm64
M1 laptop. I've been working on building multi-architecture images using docker buildx
and have run into a problem automating these builds in GCP Cloud Build and publishing to/from GCP Artifact Registry.
Our regular amd64
image build normally takes about 15 minutes. Adding the arm64
platform shoots the build time to an hour and a half. While this build works locally on my laptop, it times out on GCP Cloud Build if it is longer than an hour.
Problem Synopsis
Running a cloud build that does docker buildx build --platform linux/amd64,linux/arm64 --push
times out when it attempts to push items to Artifact Registry:
Step #1 - "long-build": > exporting to image:
Step #1 - "long-build": ------
Step #1 - "long-build": error: failed to solve: failed to fetch oauth token: unexpected status: 401 Unauthorized
Workaround
I discovered the workaround is to break the build into three steps:
- Build without
--push
or--cache-to
- Stop buildx builder
- Build normally
My guess is that stopping the builder forces oauth tokens to be re-fetched in step 3.
Details
A full explanation with reproducible example can be found at my build-time repo.
Next Steps
It isn't clear where this problem lies (e.g., in buildx
or with Google Cloud Build/Artifact Registry or some combination). I'm also going to raise this issue with Google through my company as a support ticket.
I haven't dug into the internals of buildx
(and how it interacts with things like GCP Artifact Registry) or how Docker works in an environment like GCP Cloud Build and was hoping a buildx
core contributor might have an idea of where to look or what might be happening. The workaround of stopping the builder is an interesting clue.
Possible solution?
Not sure if this will work, but maybe buildx
can request the tokens on demand, rather than at startup?
This ZcashFoundation PR seems to fix the same issue in Github CI. I wonder if there is a way to set this in GCP Cloud Build?
# Some builds might take over an hour, and Google's default lifetime duration for
# an access token is 1 hour (3600s). We increase this to 3 hours (10800s)
# as some builds take over an hour.
access_token_lifetime: 10800s
Docs on this state:
access_token_lifetime: (Optional) Desired lifetime duration of the access token, in seconds. This must be specified as the number of seconds with a trailing "s" (e.g. 30s). The default value is 1 hour (3600s). The maximum value is 1 hour, unless the constraints/iam.allowServiceAccountCredentialLifetimeExtension organization policy is enabled, in which case the maximum value is 12 hours.
I time-boxed a dive into this at an hour, but didn't learn much (to many unknowns about how Docker runs inside of GCP).
+1 @dougdonohoe seeing this issue as well. Workaround for me was to create a service account json key that I passed in base64 through env var and then performed a docker login with it prior to the buildx command. This is not ideal for security but working. Posting here to help out the next person to encounter this.
Here is implementation for anyone stuck and finding this through google:
- name: gcr.io/cloud-builders/docker
entrypoint: 'bash'
args:
- -c
- |
# Write key to "/workspace"
printf $_SERVICE_KEY | base64 --decode > /workspace/key.json
id: base64 env secret key
- name: gcr.io/cloud-builders/docker
entrypoint: 'bash'
args:
- -c
- |
# Read from "/workspace"
docker logout https://gcr.io && cat /workspace/key.json | docker login -u _json_key --password-stdin https://gcr.io &&
docker buildx build --platform $_DOCKER_BUILDX_PLATFORMS --build-arg GITHUB_TOKEN=$_GITHUB_TOKEN \
--build-arg CRYPTO_KEY=$_CRYPTO_KEY --build-arg PROJECT_NAME=$_PROJECT_NAME \
-t gcr.io/$PROJECT_ID/$_DEPLOYMENT_NAME:$BRANCH_NAME \
-t gcr.io/$PROJECT_ID/$_DEPLOYMENT_NAME:${BRANCH_NAME}_${BUILD_ID} --push .
id: build-multi-architecture-container-image
Thanks @parkerroan - out of curiosity, what permissions did you give the service account that key belongs to? Also, this sounds like a good use for Google Secret Manager. My article, noted below, has an example of using it for ssh
keys.
This also might be of interest: I just published a medium post on how to use arm64
VMs to vastly speed up docker buildx
builds in GCP Cloud Build. For us, this has reduced build times to under an hour.
Also mentioned in that article is my docker-buildx.sh script which I use to break the docker buildx
into three steps to avoid this bug.
Hey @dougdonohoe I just tumble in this very issue, did you find a way to solve this? is there any way to extend the token provided by Google?
@Davidnet I haven't explored @parkerroan's solution yet. What we do is use this script to break the build into two parts, which has provided some success: https://github.com/dougdonohoe/multi-arch-docker/blob/main/docker-buildx.sh
Perfect, @dougdonohoe I made it work by specifying the image in the image field instead of doing docker push
I could see in the logs that cloud build got unexpected status: 401 Unauthorized
but I imagine that Cloud Build retried and eventually was successful, so I guess if you specify images
cloud build will push it by default.
Hope this can bring visibility to this
"options": {
"diskSizeGb": "200",
"machineType": "N1_HIGHCPU_8",
"logging": "CLOUD_LOGGING_ONLY"
},
"timeout": "18000s",
"images": [
"us-central1-docker.pkg.dev/project/repo/image_name:image_tag"
]