serving icon indicating copy to clipboard operation
serving copied to clipboard

EKS connectivity during tag to digest resolution

Open dprotaso opened this issue 7 months ago • 10 comments

Hi, thanks for pushing this fix. I had this issue, and when I tried to upgrade to 1.18.0, I now see the following instead, which also appears to be a tag resolution issue:

    Message:                     Revision "donny-helloworld-00001" failed with message: Unable to fetch image "111111111111.ecr.us-east-1.amazonaws.com/testing:dg": failed to resolve image to digest: Get "https://111111111111.ecr.us-east-1.amazonaws.com/v2/": context deadline exceeded.

Originally posted by @dongreenberg in #15778

dprotaso avatar May 09 '25 01:05 dprotaso

@dongreenberg Following up on your comment

Given

Get "https://111111111111.ecr.us-east-1.amazonaws.com/v2/": context deadline exceeded.

Are other pods in your cluster able to connect to your registry?

dprotaso avatar May 09 '25 01:05 dprotaso

Indeed. If I create the pod manually with kubectl run it works fine, and if I create the ksvc with the exact sha256 it works as well. Only when I use a tag in the ksvc. When I try disabling resolution in the config-deployment configmap it still strangely doesn't work, with the same error.

dongreenberg avatar May 09 '25 02:05 dongreenberg

I don't have an AWS account to test with - we debugged scenarios like this before eg. AKS & GitLab

I dunno if you're able to modify this to help with debugging - https://github.com/knative/serving/issues/12761#issuecomment-1111218009

Alternatively, if you have a cluster I can poke at let me know in slack (CNCF Slack - #knative-serving)

When I try disabling resolution in the config-deployment configmap it still strangely doesn't work, with the same error.

My guess is it's not skipping the tag to digest resolution - how are you configuring it and what are you putting in the ksvc

dprotaso avatar May 09 '25 14:05 dprotaso

For some reason creating an ImagePullSecret and attaching it to the service account used in my ksvc fixes the issue. However, the node IAM role attached to every node already has full ECR access, as does the IAM role attached the service account. Only adding the ImagePullSecret does the trick. Maybe this layering of permissions is creating issues, but thinking in terms of where the actual request is made to do the tag resolution, do you have a sense of why this would be? Which k8s service account is used to actually make the resolution request/s, and does it use the pull secrets to do it?

dongreenberg avatar May 19 '25 05:05 dongreenberg

Maybe this layering of permissions is creating issues, but thinking in terms of where the actual request is made to do the tag resolution, do you have a sense of why this would be? Which k8s service account is used to actually make the resolution request/s, and does it use the pull secrets to do it?

The Knative Controller that runs as a pod is doing the tag to digest resolution. So the pod would need access to that metadata server running on the node.

One alternative that I've tested in the past was associating a fine grained policy and associating with the service account https://github.com/knative/serving/issues/9477#issuecomment-978768859

In the above there's a sample app you can run to help with debugging (I don't have access to an env). I'm guessing the metadata service isn't accessible by your Pod?

dprotaso avatar May 19 '25 14:05 dprotaso

@dongreenberg how are you setting up the EKS cluster, registry etc

dprotaso avatar May 20 '25 13:05 dprotaso

It hasn't been explicitly said in this PR how to disable the resolution.

Adding to your config-deployment is your workaround. (and works for me)

registries-skipping-tag-resolving: public.ecr.aws,xxxxx.dkr.ecr.ca-west-1.amazonaws.com

treyhyde avatar Jul 23 '25 20:07 treyhyde

This issue is stale because it has been open for 90 days with no activity. It will automatically close after 30 more days of inactivity. Reopen the issue with /reopen. Mark the issue as fresh by adding the comment /remove-lifecycle stale.

github-actions[bot] avatar Oct 22 '25 01:10 github-actions[bot]

/lifecycle frozen

Sadly still waiting on an AWS account from CNCF :/

A contributor also included some docs about using Pod Identity here: https://github.com/knative/docs/pull/6369/files

dprotaso avatar Oct 22 '25 02:10 dprotaso

@dongreenberg you mind trying out the Pod Identity approach as a work around?

dprotaso avatar Oct 22 '25 02:10 dprotaso