kube-image-keeper
kube-image-keeper copied to clipboard
Image caching failed caused by missing pull secret from deleted namespace
kube-image-keeper: v1.10.0
Here is our setup:
- Deploy kube-image-keeper with latest v1.10.0
- Deploy a deployment using an image from a private gcr registry: gcr.io/myproject/test-image:0.0.1, while providing image pull secret "gcr-myproject-pullsecret" in namespace "test-1"
- After the image gcr-io-myproject-test-image-0.0.1 is fully cached, delete the deployment, and delete the namespace "test-1"
- Deploy a deployment using image: gcr.io/myproject/test-image:0.0.2, using the same image pull secret content in namespace "test-2".
- The pod will be failed to be deployed, because the Repository.kuik.enx.io object is still referencing to the non-existing "test-1/gcr-myproject-pullsecret` pull secret when caching the new image tag for the same repository.
Our clusters are not hosted on the cloud so we don't know if the recent PR #428 gonna help us. Question: Is there a way to specify / inject the global pull secret for the entire cluster for repository using a certain prefixes? We don't mind to inject to kuik deployment if it is allowed.
Hello,
#428 will not help you on this one. And indeed its a bug, I think it will not be to hard to fix, I will try to work on it during this week.
Concerning your question, there is currently no way to do what you ask. You still can use pull secrets attached to a service account and it will be used for pods using this service account, but I understand that it doesn't exactly achieve what you try to do. #385 asked for a similar feature, and while adding an option of a global ImagePullSecret is not something we intend to do, I find the idea of injecting pull secrets for repositories with a specific prefix interesting. But I'm still not sure if it is in the scope of kuik. Maybe a kyverno policy would be enough?
For instance something like this (generated with ChatGPT):
apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
name: add-pull-secret-to-cachedimage
spec:
rules:
- name: add-pull-secret
match:
resources:
kinds:
- CachedImage
preconditions:
all:
- key: "{{ request.object.metadata.labels['kuik.enix.io/repository'] }}"
operator: In
value: ["registry.k8s.io-kube-state-metrics-kube-state-metrics"]
mutate:
patchStrategicMerge:
spec:
imagePullSecrets:
- name: my-pull-secret # Replace with the desired pull secret name
Hello, I cannot reproduce, or at least not in the way I thought.
Is your issue about the pull secret not being found and thus the program crashes or something nasty like this? Or is it about the pull of the image not being authenticated because the secret is missing?
If you delete a secret the 2nd option is totally expected and there is nothing we can do about it, no secret => no authentication => no pull of the image. Otherwise (1st option) it is a bug but I currently can't reproduce it, attaching related logs here could be helpful.
Hi @paullaffitte , I would believe the 2nd scenario is what happening here.
This is actual a failure scenario, that current mechanism does not cover: Repository.kuik.enx.io is a cluster-scoped object, but it is referring to a namespace-scoped pull secret.
To the cluster admin, the kuik is supposed to be transparent, so that the admin will not be aware the fact that once the repository from the private registry has been used before, the pull secret in original namespace cannot be removed. Otherwise, just like you said, even the same pull secret has been provided in a different namespace, the kuik operator will not be able to pull a different version of image from the same private registry because the Repository.kuik.enx.io is still trying to looking for the pull secret from the now-deleted namespace, not the from new namespace the same or different pull secret is being stored.
As for my suggestion:
Is there a way to specify / inject the global pull secret for the entire cluster for repository using a certain prefixes? We don't mind to inject to kuik deployment if it is allowed.
Instead the using the pull secret from the namespace where the repository.kuik.enix.io is first time created, the pull secret will be coming from a namespace not going to deleted unexpectedly, thus, solving the problem posed by the failure scenario.
I've tested on the latest version 1.11.0, and it seems the problem is not there anymore. I'll close it for now.
Hello,
I was writing an internal note about the recent crashes in kuik and how we handled them when I stumbled open this issue. Trying to remember what it was about I got a second reading of it and finally understood your issue, and I can confirm that it is unfortunately not fixed at all. Here are steps to reproduce:
kubectl create namespace-a && kubectl create namespace-b- create a pull secret
secret-ainnamespace-a - create a pod with image
https://private-registry/image:version-ainnamespace-awithsecret-a kubectl delete ns namespace-a- create a pod with image
https://private-registry/image:version-binnamespace-bwithout a secret
Here it fails because image:version-b is not cached yet and secret-a (which is referenced by the repository CRD) has been deleted alongside namespace-a, starting a pod with image:version-a would have worked since the image has already been cached. However, creating a pod https://private-registry/image:version-b in namespace-b with an existing secret (let's say secret-b from namespace-b) would work, since the secret reference in the repository CRD would be updated with the new value.
The real issue here is the kuik controller log, which is very unclear:
"error": "GET https://index.docker.io/v2/library/redacted-image-repo-problem/manifests/tag: UNAUTHORIZED: authentication required; [map[Action:pull Class: Name:library/redacted-image-repo-problem Type:repository]]"
And in the CachedImage event
Failed to cache image REDACTED/plaffitt-dev/nginx:1.25-alpine-4, reason: GET https://REDACTED/v2/plaffitt-dev/nginx/manifests/1.25-alpine-4: UNAUTHORIZED: unauthorized to access repository: plaffitt-dev/nginx, action: pull: unauthorized to access repository: plaffitt-dev/nginx, action: pull
And maybe the missing event in the Repository. All of that make the situation very hard to read.