EKS connectivity during tag to digest resolution
Hi, thanks for pushing this fix. I had this issue, and when I tried to upgrade to 1.18.0, I now see the following instead, which also appears to be a tag resolution issue:
Message: Revision "donny-helloworld-00001" failed with message: Unable to fetch image "111111111111.ecr.us-east-1.amazonaws.com/testing:dg": failed to resolve image to digest: Get "https://111111111111.ecr.us-east-1.amazonaws.com/v2/": context deadline exceeded.
Originally posted by @dongreenberg in #15778
@dongreenberg Following up on your comment
Given
Get "https://111111111111.ecr.us-east-1.amazonaws.com/v2/": context deadline exceeded.
Are other pods in your cluster able to connect to your registry?
Indeed. If I create the pod manually with kubectl run it works fine, and if I create the ksvc with the exact sha256 it works as well. Only when I use a tag in the ksvc. When I try disabling resolution in the config-deployment configmap it still strangely doesn't work, with the same error.
I don't have an AWS account to test with - we debugged scenarios like this before eg. AKS & GitLab
I dunno if you're able to modify this to help with debugging - https://github.com/knative/serving/issues/12761#issuecomment-1111218009
Alternatively, if you have a cluster I can poke at let me know in slack (CNCF Slack - #knative-serving)
When I try disabling resolution in the config-deployment configmap it still strangely doesn't work, with the same error.
My guess is it's not skipping the tag to digest resolution - how are you configuring it and what are you putting in the ksvc
For some reason creating an ImagePullSecret and attaching it to the service account used in my ksvc fixes the issue. However, the node IAM role attached to every node already has full ECR access, as does the IAM role attached the service account. Only adding the ImagePullSecret does the trick. Maybe this layering of permissions is creating issues, but thinking in terms of where the actual request is made to do the tag resolution, do you have a sense of why this would be? Which k8s service account is used to actually make the resolution request/s, and does it use the pull secrets to do it?
Maybe this layering of permissions is creating issues, but thinking in terms of where the actual request is made to do the tag resolution, do you have a sense of why this would be? Which k8s service account is used to actually make the resolution request/s, and does it use the pull secrets to do it?
The Knative Controller that runs as a pod is doing the tag to digest resolution. So the pod would need access to that metadata server running on the node.
One alternative that I've tested in the past was associating a fine grained policy and associating with the service account https://github.com/knative/serving/issues/9477#issuecomment-978768859
In the above there's a sample app you can run to help with debugging (I don't have access to an env). I'm guessing the metadata service isn't accessible by your Pod?
@dongreenberg how are you setting up the EKS cluster, registry etc
It hasn't been explicitly said in this PR how to disable the resolution.
Adding to your config-deployment is your workaround. (and works for me)
registries-skipping-tag-resolving: public.ecr.aws,xxxxx.dkr.ecr.ca-west-1.amazonaws.com
This issue is stale because it has been open for 90 days with no
activity. It will automatically close after 30 more days of
inactivity. Reopen the issue with /reopen. Mark the issue as
fresh by adding the comment /remove-lifecycle stale.
/lifecycle frozen
Sadly still waiting on an AWS account from CNCF :/
A contributor also included some docs about using Pod Identity here: https://github.com/knative/docs/pull/6369/files
@dongreenberg you mind trying out the Pod Identity approach as a work around?