k8s-image-swapper Not replacing pruned images

We're currently using k8s-image-swapper to proxy all the images in our cluster to ECR. I have setup a lifecycle policy for the images to delete after 30 days after push.

However once the image gets pruned by ECR k8s-image-swapper won't replace it if the image is requested again causing an error pulling the image because it no longer exists in ECR. Only fix I've found is to restart k8s-image-swapper pods and then the image gets pulled and pushed to ECR correctly.

Nov 19 '24 20:11 ppapp92

@ppapp92 The reason for this behaviour is k8s-image-swapper is caching the existence of that image for a period of time (24h 🤔 ) to reduce strain on ECR:

https://github.com/estahn/k8s-image-swapper/blob/e817e04def613087fa02a3b0d46c47173f63aba0/pkg/registry/ecr.go#L234-L239

Restarting k8s-image-swapper will purge the in-memory cache and therefore will know it doesn't exist.

ECR doesn't signal the execution of its lifecycle policy runs, so essentially only pulling for that info is possible as far as I know.

Possible option to resolve your issue:

Change ECR lifecycle policy to retain the last image and have a different process checking if the image is still in use and if not cleanup.
Disabling the in-memory cache - this feature would need to be added.

Nov 19 '24 22:11 estahn

@estahn Hello!

Today we spent a couple hours debugging an issue where one of the images in our google artifact registry was removed per our cleanup policy (which we have now increased). We had GKE nodes that were unable to start because of this exact same issue (the coincidence on this being the most recent issue when I looked was a good laugh for us). Having the ability to either disable the in-memory cache or tune the timing of the cache (maybe 1-2 hours?) to avoid this would be awesome, add us to the list of people wanting some more options around this!

edit: I've gave this a shot at #847, let me know what you think!

Nov 26 '24 02:11 ahatzz11

Just a perspective from someone that just started using this project:

It seems like the underlying challenge is that there's no bridge between the external purge management (ECR policies and whatnot) and the internal cache that is kept in memory.

I had a conversation with one of my teammates about this and having built a few Kubernetes Operators in my lifetime, it seems that what's missing are two things to make this issue completely goes away:

An internal Go routing to warm-up the cache from scratch from an external registry that is periodically triggered
A way to determine images that have been swapped and are not longer present in the remote registry by listening to pod events (more specifically a pod state of "ImagePullBackOff")
- By listening to these events, you could also even evict a pod that has that issue to ensure no long-term disruption occours

With these two processes in place, there's very little need for any type of TTL that attempts to keep the two sources (cache and external OCI registry) in sync. In the other hand, this would make this project surprisingly close to a fully fledged operator as it would have to use libraries such as https://github.com/kubernetes/client-go/tree/master/informers and possibly https://github.com/kubernetes-sigs/controller-runtime.

This is a very interesting project, so I gotta say congrats for the initiative :smiley: .

Nov 26 '24 23:11 migueleliasweb

@ahatzz11 I do appreciate the initiative and time you've taken to contribute a PR. The PR itself looks great (code, docs, tests, etc) and something we could possibly merge. However I think this is going in the wrong direction. Whatever time the cache will be set to will eventually raise aforementioned failure as the "cleaning" process runs at a non-predictable time (at last for ECR).

The diagram below hopefully shows deleting an image from the registry will always raise a failure due to the internal cache regardless of the TTL.

@migueleliasweb Awesome! I really like the idea of using ImagePullBackOff as a signal to invalidate the cache for a particular image. This would also avoid further dependency on the image registry or possible cleaning policies and keeps everything within the "Kubernetes world".

sequenceDiagram
    Kubernetes API->>+k8s-image-swapper: 1. Webhook request
    alt Image exists according to cache?
    else Image exists according to image registry?
        k8s-image-swapper->k8s-image-swapper: Update internal cache that image exists    
    else
        k8s-image-swapper->>Image Registry: Push image to registry
        k8s-image-swapper->k8s-image-swapper: Update internal cache that image exists    
    end
    k8s-image-swapper->k8s-image-swapper: Replace image in webhook response
    k8s-image-swapper->>-Kubernetes API: 1. Webhook response
    Image Registry "Cleaner"->>Image Registry: Delete images
    Kubernetes API->>+k8s-image-swapper: 2. Webhook request
    alt Image exists according to cache?
    else Image exists according to image registry?
        k8s-image-swapper->k8s-image-swapper: Update internal cache that image exists    
    else
        k8s-image-swapper->>Image Registry: Push image to registry
        k8s-image-swapper->k8s-image-swapper: Update internal cache that image exists    
    end
    k8s-image-swapper->k8s-image-swapper: Replace image in webhook response
    k8s-image-swapper->>-Kubernetes API: 2. Webhook response

Nov 27 '24 00:11 estahn

@estahn I love those flowcharts from MermaidJS :star_struck:

However I think this is going in the wrong direction. Whatever time the cache will be set to will eventually raise aforementioned failure as the "cleaning" process runs at a non-predictable time (at last for ECR).

The diagram below hopefully shows deleting an image from the registry will always raise a failure due to the internal cache regardless of the TTL.

That's precisely my point. Mitigations can be made, but there will always be a bit of delay/disruption if we go this direction.

Ps: Do you think by listening to pod events and periodically performing a full-sync from the ECR registry this problem would vanish? I personally think so.

Ps2: Is there a reason why not to cache the whole ECR repo tags (maybe with an option for a filter)? Even with many thousands of images/tags as strings, something like a sync.Map would be able to handle it with under a hundred megs, I reckon. Maybe that's an option too?

Nov 27 '24 00:11 migueleliasweb

Ps: Do you think by listening to pod events and periodically performing a full-sync from the ECR registry this problem would vanish? I personally think so.

I think listening to the pod event as you described and either purging the entire cache or a specific cache entry solves the issue.

I would argue its unlikely you need all images of the registry. The usual scenario is moving from one image version to the next. Therefore I don't see much of a benefit to hold this all in cache. If there is a good argument to be made I'm not against it.

Ps2: Is there a reason why not to cache the whole ECR repo tags (maybe with an option for a filter)? Even with many thousands of images/tags as strings, something like a sync.Map would be able to handle it with under a hundred megs, I reckon. Maybe that's an option too?

There wasn't a need to as far as I'm concerned. Also keeping things lean and efficient :)

Nov 27 '24 01:11 estahn

@estahn

However I think this is going in the wrong direction.

I agree 😄 Having the cache at all introduces the opportunity for it to be wrong. I figured being able to set it to something like 30 minutes or something much smaller than 24 hours would at least make that time that is wrong a lot shorter/maybe unnoticeable, at least in our situation where it took us a couple hours to figure out what was even wrong. Listening to pod events would be spectacular, but much more of a lift than I have time to put into it. If this is something you do we would love to test it out!

Dec 03 '24 01:12 ahatzz11

TTL is probably the easiest fix but a more comprehensive solution makes sense long term. For our use case right now even if it had to check ECR every time a new pod was being deployed the amount of API calls to check ECR would be pretty low.

Dec 05 '24 20:12 ppapp92

@estahn What are your thoughts about having this caching configuration as a short-medium term option for people to improve this, with the long term plan being a bit more tied to k8s events and truly be dynamic? It would definitely help out our organization.

Dec 09 '24 19:12 ahatzz11

@estahn Happy new year! Now that the holidays are over I just wanted to check in with you to see your thoughts on the cache configuration options as a short-medium term option before proper k8s events are in. #847 is ready to go!

Jan 18 '25 00:01 ahatzz11

@estahn : just checking if there's any progress on approving that PR, as it would help us too ?

Feb 26 '25 23:02 jgournet

Also bumping this again. It would be really useful to have this automated. Currently I need to monitor and manually restart the pods when I start getting imagepullback error.

Feb 27 '25 16:02 ppapp92

@estahn Happy new year! Now that the holidays are over I just wanted to check in with you to see your thoughts on the cache configuration options as a short-medium term option before proper k8s events are in. #847 is ready to go!

+1 for this !

We are using version compiled from @ahatzz11 PR and it works like a charm !

Apr 01 '25 12:04 tcyran

Bumping this again. It would be really nice to have this functionality in the app without needing to compile locally.

Apr 10 '25 15:04 ppapp92

Bumping again.

May 05 '25 20:05 ppapp92

It’s a recurring problem for us as well, ECR don’t support any optimal way to design a LCP policy to preserve images that we want so now it totally fall on writing a custom hook where we pull the image from any public registry Retag the image and have lcp policy on top of those tagged image to ignore them but if the PR https://github.com/estahn/k8s-image-swapper/pull/847 could be pushed and helm chart also have that feature to adjust the TTL as per our requirement then it would be really helpful for everyone @estahn we appreciate this project but this issue is very recurring and causing problem to a lot of users

Jun 28 '25 11:06 ayushgaurpseudo

Bumping again

Aug 11 '25 13:08 ppapp92

@estahn It's been almost a year with this issue and there's been a PR #847 which could resolve the issue.

Oct 16 '25 14:10 ppapp92