image-automation-controller icon indicating copy to clipboard operation
image-automation-controller copied to clipboard

Commits happening too frequently after v1alpha2

Open nomeelnoj opened this issue 3 years ago • 5 comments

I upgraded to 0.13 and as such migrated our ImageUpdateAutomation object to v1apha2. However, seeing some strange behavior.

Before this update, the controller would respect the interval I set of 7m30s and only create a single git commit with all the images that were pushed. I set the time period this long because we push up to 25 images and it can take a bit of time for flux to see them all, and I wanted a single git commit.

With the new version, it seems that flux only runs the reconciler every 7.5 min as expected, but im getting multiple commits in git. Did the process by which commits are made change? We run our ImageRepository objects every 1m, is that now somehow responsible for the commits? Seeing flux commit up to 6 times during a single release just pollutes the git history.

Any help would be much appreciated!

nomeelnoj avatar May 25 '21 23:05 nomeelnoj

Hello, thank you for trying out the latest release, and taking time to report this bug.

I think what you might be seeing is the result of ddd0a8d8ed1606f20de2fe402768fc1acb84d790, which makes an automation run every time there's an update to an image policy. This is the "correct" way for it to work, in the sense that it fits the reconciliation model straight-forwardly ("every time the observed world changes, make a compensating action to restore the desired state"). But I can see why you prefer the prior situation, in which changes were naturally batched.

I can also see running to a schedule might not work perfectly -- perhaps the 7.5 minute deadline falls in the middle of updating a bunch of images, and you only get some of them in a commit. Is there a better way to make sure the batching works? Maybe if you could trigger it to run only when you've pushed all your images. WDYT?

squaremo avatar May 27 '21 11:05 squaremo

yes, 7.5 min does not always capture it, but once in a while two commits is not the end of the world, however every single release throwing 3-5 commits creates a lot of cruft and makes it hard to track changes.

Seeing as we have so many images, can you think of any way to get them all to update at the same time? If the ImagePolicy object could refer to more than a single image repository, this issue would be resolved for us because we use the same tagging scheme across all our images.

I also dont see increasing the reconcile loop across all our image policies as a way to fix this, because each image policy will have its own reconcile loop.

When you say "trigger after we have pushed all our images" that is possible from our CI system, are you referring to notification rules in flux to schedule that reconciliation?

nomeelnoj avatar Jun 04 '21 22:06 nomeelnoj

Flux can push all the image updates to a different branch than the one used to sync the cluster. After all images have been updated, you could merge (with squash) into the main branch.

stefanprodan avatar Jun 05 '21 09:06 stefanprodan

I think we will try a combination of increasing the reconcile loop on the ImagePolicy objects as well as using a Notification to trigger a reconciliation of all our various policy objects. Id like to just set the ImagePolicy objects to reconcile every 7.5 min, but since we cant have a single policy tied to more than 1 image, that could cause the same issue since they each reconcile on their own loop. Thanks for the suggestions--ill give it a shot!

nomeelnoj avatar Jun 08 '21 18:06 nomeelnoj

Is this issue still affecting Flux users on v1beta1 image APIs?

(There were some major upgrades in Flux 0.16; users would have had to upgrade their Image resources from:

  • v1alpha1 or v1alpha2
  • to v1beta1

If you have already upgraded and are still experiencing these issues, please comment here.

I wonder if ImageUpdateAutomation resources subscribe or are notified (or if they can be) when ImageRepo changes happen, or when ImagePolicy resources have a new image for downstreams? It seems the issue would be worse when all webhooks are enabled and events are transmitted immediately, because mostly ImageRepo will be updated one at a time by CI. It almost would be desirable to "un-hook" these so there is no subscription, so as long as the images are published within a few minutes of each other, when ImageUpdateAutomation runs on a reconciliation cycle of once every 15 or 30 minutes, it would almost always capture the whole batch as a single update.

I don't think this can be totally solved without something like Helm that really captures the whole collection of app images in a unit, like the values file, and versions them together as in Chart.yaml with a chart version. That way you can control which images are released together, and they only get upgraded as a group together when you say a new chart version is released.

kingdonb avatar Aug 19 '21 17:08 kingdonb

Extensive changes took place in the controller since this was last reported, due to lack of activity I will be closing this issue.

pjbgf avatar Dec 14 '22 13:12 pjbgf