argocd-image-updater icon indicating copy to clipboard operation
argocd-image-updater copied to clipboard

AWS ECR: Could not set registry endpoint credentials ... failed timeout after 10s

Open bcbrockway opened this issue 1 year ago • 7 comments

Describe the bug

We have the Image Updater running on EKS clusters using IRSA to link them to an IAM role that grants it permissions to our ECR registry. In addition, we have an auth script configured to run an awscli command to grab a new token every 11 hours:

# configmap/argocd-image-updater-config
# ...
data:
  registries.conf: |
    registries:
    - api_url: https://000000000000.dkr.ecr.us-east-2.amazonaws.com
      credentials: ext:/scripts/ecr-login-us-east-2.sh
      credsexpire: 11h
      name: ECR
      prefix: 000000000000.dkr.ecr.us-east-2.amazonaws.com

# configmap/argocd-image-updater-authscripts
# ...
data:
  ecr-login-us-east-2.sh: |
    #!/bin/sh
    aws ecr --region 'us-east-2' get-authorization-token --cli-read-timeout 5 --cli-connect-timeout 5 --output text --query 'authorizationData[].authorizationToken' | base64 -d

This usually works on startup, and sometimes after credsexpire, but it also often fails with:

Could not set registry endpoint credentials: error executing /scripts/ecr-login-us-east-2.sh: /scripts/ecr-login-us-east-2.sh failed timeout after 10s

Sometimes this can take hours of retries to rectify and sometimes nothing short of killing the pod and starting a new one will fix it.

It's also weird that it seems to run this script once for each app in its update cycle (see logs below) rather than just running it once seeing as we've configured at the registry level.

To Reproduce Set up as above. Unfortunately, this is intermittent.

Expected behavior The script runs correctly (once) and stores the new token for all apps to use.

Additional context N/A

Version 0.12.0

Logs

2023-12-20T14:11:38+00:00	time="2023-12-20T14:11:38Z" level=info msg="Processing results: applications=4 images_considered=3 images_skipped=1 images_updated=0 errors=1"
2023-12-20T14:11:38+00:00	time="2023-12-20T14:11:38Z" level=info msg="Starting image update cycle, considering 2 annotated application(s) for update"
2023-12-20T14:11:39+00:00	time="2023-12-20T14:11:39Z" level=info msg="Processing results: applications=2 images_considered=2 images_skipped=0 images_updated=0 errors=0"
2023-12-20T14:12:09+00:00	time="2023-12-20T14:12:09Z" level=info msg="Starting image update cycle, considering 2 annotated application(s) for update"
2023-12-20T14:12:10+00:00	time="2023-12-20T14:12:10Z" level=info msg="Processing results: applications=2 images_considered=2 images_skipped=0 images_updated=0 errors=0"
2023-12-20T14:12:30+00:00	{"log":"time=\"2023-12-20T14:12:30Z\" level=info msg=\"Starting image update cycle, considering 3 annotated application(s) for update\"\n","stream":"stdout","time":"2023-12-20T14:12:30.44525427Z"}
2023-12-20T14:12:31+00:00	{"log":"time=\"2023-12-20T14:12:31Z\" level=info msg=\"Processing results: applications=3 images_considered=2 images_skipped=2 images_updated=0 errors=0\"\n","stream":"stdout","time":"2023-12-20T14:12:31.748061081Z"}
2023-12-20T14:12:41+00:00	time="2023-12-20T14:12:41Z" level=info msg="Starting image update cycle, considering 19 annotated application(s) for update"
2023-12-20T14:12:41+00:00	time="2023-12-20T14:12:41Z" level=info msg=/scripts/ecr-login-us-east-2.sh dir= execID=7dddb
2023-12-20T14:12:41+00:00	time="2023-12-20T14:12:41Z" level=info msg=/scripts/ecr-login-us-east-2.sh dir= execID=33a72
2023-12-20T14:12:41+00:00	time="2023-12-20T14:12:41Z" level=info msg=/scripts/ecr-login-us-east-2.sh dir= execID=2a859
2023-12-20T14:12:41+00:00	time="2023-12-20T14:12:41Z" level=info msg=/scripts/ecr-login-us-east-2.sh dir= execID=e6515
2023-12-20T14:12:41+00:00	time="2023-12-20T14:12:41Z" level=info msg=/scripts/ecr-login-us-east-2.sh dir= execID=dce93
2023-12-20T14:12:41+00:00	time="2023-12-20T14:12:41Z" level=info msg=/scripts/ecr-login-us-east-2.sh dir= execID=d9146
2023-12-20T14:12:41+00:00	time="2023-12-20T14:12:41Z" level=info msg=/scripts/ecr-login-us-east-2.sh dir= execID=508d5
2023-12-20T14:12:41+00:00	time="2023-12-20T14:12:41Z" level=info msg=/scripts/ecr-login-us-east-2.sh dir= execID=68554
2023-12-20T14:12:41+00:00	time="2023-12-20T14:12:41Z" level=info msg=/scripts/ecr-login-us-east-2.sh dir= execID=3c106
2023-12-20T14:12:41+00:00	time="2023-12-20T14:12:41Z" level=info msg=/scripts/ecr-login-us-east-2.sh dir= execID=d7263
2023-12-20T14:12:51+00:00	time="2023-12-20T14:12:51Z" level=error msg="`/scripts/ecr-login-us-east-2.sh` failed timeout after 10s" execID=7dddb
2023-12-20T14:12:51+00:00	time="2023-12-20T14:12:51Z" level=error msg="Could not set registry endpoint credentials: error executing /scripts/ecr-login-us-east-2.sh: `/scripts/ecr-login-us-east-2.sh` failed timeout after 10s" alias=report-subscription-event-producer application=report-subscription-event-producer image_name=gitlab/mintel/core-services/report-subscription-event-producer image_tag=ebdfe4eccab090c0d5a60a3bd4aae4aa7b8c3ae2-test registry=000000000000.dkr.ecr.us-east-2.amazonaws.com
2023-12-20T14:12:51+00:00	time="2023-12-20T14:12:51Z" level=info msg=/scripts/ecr-login-us-east-2.sh dir= execID=b2204
2023-12-20T14:12:51+00:00	time="2023-12-20T14:12:51Z" level=error msg="`/scripts/ecr-login-us-east-2.sh` failed timeout after 10s" execID=2a859
2023-12-20T14:12:51+00:00	time="2023-12-20T14:12:51Z" level=error msg="Could not set registry endpoint credentials: error executing /scripts/ecr-login-us-east-2.sh: `/scripts/ecr-login-us-east-2.sh` failed timeout after 10s" alias=ataccama-event-bridge application=ataccama-event-bridge image_name=gitlab/mintel/data-warehouse/agents/reference-data/ataccama-event-bridge image_tag="sha256:80c37d6719f3f2fd3e24a5264e2e1fbf1e37cf06a308f379db88ca55639ae498" registry=000000000000.dkr.ecr.us-east-2.amazonaws.com

bcbrockway avatar Dec 22 '23 10:12 bcbrockway

Setting --max-concurrency to 1 works for me, although I don't know exactly how this fixes the problem 😅 https://argocd-image-updater.readthedocs.io/en/stable/install/reference/#flags

extraArgs:
  - --max-concurrency
  - "1"

PuChenTW avatar Jan 02 '24 09:01 PuChenTW

Setting --max-concurrency to 1 works for me, although I don't know exactly how this fixes the problem 😅 https://argocd-image-updater.readthedocs.io/en/stable/install/reference/#flags

extraArgs:
  - --max-concurrency
  - "1"

Some of our ArgoCD instances have a lot of apps so this would slow us down quite a bit :(

bcbrockway avatar Jan 05 '24 11:01 bcbrockway

This still seems to happen even with --max-concurrency set to 1. Is this still happening to anyone else? Where is the 10s timeout being set and can it be extended?

It's not related to caching invalid token data or something for the lifetime of credsexpire if one call fails or something, is it?

tareks avatar Aug 16 '24 21:08 tareks