buildx icon indicating copy to clipboard operation
buildx copied to clipboard

GCP Cloud Build 1 hour timeout - failed to fetch oauth token: unexpected status: 401 Unauthorized

Open dougdonohoe opened this issue 2 years ago • 1 comments

Background

At my company, I have a new Apple M1 MacBook Pro. Most of our build infrastructure is all amd64 images, which run very slow and flakey on the arm64 M1 laptop. I've been working on building multi-architecture images using docker buildx and have run into a problem automating these builds in GCP Cloud Build and publishing to/from GCP Artifact Registry.

Our regular amd64 image build normally takes about 15 minutes. Adding the arm64 platform shoots the build time to an hour and a half. While this build works locally on my laptop, it times out on GCP Cloud Build if it is longer than an hour.

Problem Synopsis

Running a cloud build that does docker buildx build --platform linux/amd64,linux/arm64 --push times out when it attempts to push items to Artifact Registry:

Step #1 - "long-build":  > exporting to image:
Step #1 - "long-build": ------
Step #1 - "long-build": error: failed to solve: failed to fetch oauth token: unexpected status: 401 Unauthorized

Workaround

I discovered the workaround is to break the build into three steps:

  1. Build without --push or --cache-to
  2. Stop buildx builder
  3. Build normally

My guess is that stopping the builder forces oauth tokens to be re-fetched in step 3.

Details

A full explanation with reproducible example can be found at my build-time repo.

Next Steps

It isn't clear where this problem lies (e.g., in buildx or with Google Cloud Build/Artifact Registry or some combination). I'm also going to raise this issue with Google through my company as a support ticket.

I haven't dug into the internals of buildx (and how it interacts with things like GCP Artifact Registry) or how Docker works in an environment like GCP Cloud Build and was hoping a buildx core contributor might have an idea of where to look or what might be happening. The workaround of stopping the builder is an interesting clue.

Possible solution?

Not sure if this will work, but maybe buildx can request the tokens on demand, rather than at startup?

dougdonohoe avatar Jul 11 '22 12:07 dougdonohoe

This ZcashFoundation PR seems to fix the same issue in Github CI. I wonder if there is a way to set this in GCP Cloud Build?

          # Some builds might take over an hour, and Google's default lifetime duration for
          # an access token is 1 hour (3600s). We increase this to 3 hours (10800s)
          # as some builds take over an hour.
          access_token_lifetime: 10800s

Docs on this state:

access_token_lifetime: (Optional) Desired lifetime duration of the access token, in seconds. This must be specified as the number of seconds with a trailing "s" (e.g. 30s). The default value is 1 hour (3600s). The maximum value is 1 hour, unless the constraints/iam.allowServiceAccountCredentialLifetimeExtension organization policy is enabled, in which case the maximum value is 12 hours.

I time-boxed a dive into this at an hour, but didn't learn much (to many unknowns about how Docker runs inside of GCP).

dougdonohoe avatar Jul 11 '22 12:07 dougdonohoe

+1 @dougdonohoe seeing this issue as well. Workaround for me was to create a service account json key that I passed in base64 through env var and then performed a docker login with it prior to the buildx command. This is not ideal for security but working. Posting here to help out the next person to encounter this.

Here is implementation for anyone stuck and finding this through google:

  - name: gcr.io/cloud-builders/docker
    entrypoint: 'bash'
    args:
      - -c
      - |
        # Write key to "/workspace"
        printf $_SERVICE_KEY | base64 --decode > /workspace/key.json
    id: base64 env secret key
  - name: gcr.io/cloud-builders/docker
    entrypoint: 'bash'
    args:
      - -c
      - |
        # Read from "/workspace"
        docker logout https://gcr.io && cat /workspace/key.json | docker login -u _json_key --password-stdin https://gcr.io &&
        docker buildx build --platform $_DOCKER_BUILDX_PLATFORMS --build-arg GITHUB_TOKEN=$_GITHUB_TOKEN \
        --build-arg CRYPTO_KEY=$_CRYPTO_KEY --build-arg PROJECT_NAME=$_PROJECT_NAME \
        -t gcr.io/$PROJECT_ID/$_DEPLOYMENT_NAME:$BRANCH_NAME \
        -t gcr.io/$PROJECT_ID/$_DEPLOYMENT_NAME:${BRANCH_NAME}_${BUILD_ID} --push .
    id: build-multi-architecture-container-image

parkerroan avatar Jan 23 '23 20:01 parkerroan

Thanks @parkerroan - out of curiosity, what permissions did you give the service account that key belongs to? Also, this sounds like a good use for Google Secret Manager. My article, noted below, has an example of using it for ssh keys.

This also might be of interest: I just published a medium post on how to use arm64 VMs to vastly speed up docker buildx builds in GCP Cloud Build. For us, this has reduced build times to under an hour.

Also mentioned in that article is my docker-buildx.sh script which I use to break the docker buildx into three steps to avoid this bug.

dougdonohoe avatar Jan 25 '23 13:01 dougdonohoe

Hey @dougdonohoe I just tumble in this very issue, did you find a way to solve this? is there any way to extend the token provided by Google?

Davidnet avatar May 02 '23 12:05 Davidnet

@Davidnet I haven't explored @parkerroan's solution yet. What we do is use this script to break the build into two parts, which has provided some success: https://github.com/dougdonohoe/multi-arch-docker/blob/main/docker-buildx.sh

dougdonohoe avatar May 02 '23 14:05 dougdonohoe

Perfect, @dougdonohoe I made it work by specifying the image in the image field instead of doing docker push I could see in the logs that cloud build got unexpected status: 401 Unauthorized but I imagine that Cloud Build retried and eventually was successful, so I guess if you specify images cloud build will push it by default.

Hope this can bring visibility to this

"options": {
        "diskSizeGb": "200",
        "machineType": "N1_HIGHCPU_8",
        "logging": "CLOUD_LOGGING_ONLY"
    },
    "timeout": "18000s",
    "images": [
        "us-central1-docker.pkg.dev/project/repo/image_name:image_tag"
    ]

Davidnet avatar May 02 '23 19:05 Davidnet