image-reflector-controller
image-reflector-controller copied to clipboard
Refactor reconcilers and introduce v1beta2 API
This change refactors image-reflector-controller to implement the new unified standards of flux controllers. It uses the result finalizer from https://github.com/fluxcd/pkg/pull/329 for computing the status conditions of ImageRepository and ImagePolicy objects such that they are kstatus compliant.
Following are some highlights of the changes:
- Remove v1alpha API CRDs.
- Introduce v1beta2 API which removes the old status condition readiness API and replaces them with the new
conditions.Setter
API from https://pkg.go.dev/github.com/fluxcd/pkg/[email protected]/conditions#Setter, along with other minor validation checks in the API fields. - Cleanup the
controllers/
directory by moving the code that are not directly/strictly related to reconciliation operation into theinternal/
directory, along with their tests. - Refactor the ImageRepository and ImagePolicy to use the helpers from https://pkg.go.dev/github.com/fluxcd/pkg/runtime.
- The logs, events and notifications have been standardized for useful information to be helpful in debugging.
- Introduce latest tags in ImageRepository status to list the 10 latest scanned tags.
- Introduce observed previous image in ImagePolicy status to keep track of the previous image so that it can provide update path information.
- The reconciliation business logic of both the reconcilers have been divided into smaller functions that are tested separately, out of the reconciliation loop.
- The existing controller e2e tests weren't changed much in order to use them to ensure the behavior remains the same as much as possible. They are slightly modified to check the kstatus conditions after every run.
- Cloud provider contextual login is configured per object with
.spec.provider
in ImageRepository.
The new status conditions under various scenarios can be seen in:
- ImageRepository - https://gist.github.com/darkowlzz/19b0ef234f1c68baa42365726db1028e
- ImagePolicy - https://gist.github.com/darkowlzz/bb017ba22a932fd4e7e2a4f421dbcbd2
Instructions for testing
To test this, checkout this branch and build the container image with make docker-build IMG=<image-name>
. Before deploying the container image, update/install the CRDs to v1beta2 API with make install
or apply the CRD definition manifests in config/crd/bases
. Create image objects and test.
Fix: fluxcd/image-reflector-controller#259 Fix: fluxcd/image-reflector-controller#166
@darkowlzz please delete the v1alpha CRDs, we haven't kept them around for the others controllers, so it's time to remove them here too.
For ImageRepository
In propose that we add the latest ten tags fetched from the registry. The tag list would help users debug polices by looking up .status.lastScanResult.latestTags
e.g.:
status:
canonicalImageName: ghcr.io/stefanprodan/podinfo
conditions:
- lastTransitionTime: "2022-09-12T11:59:49Z"
message: successful scan, found 34 tags
observedGeneration: 1
reason: Succeeded
status: "True"
type: Ready
lastScanResult:
scanTime: "2022-09-12T11:59:49Z"
tagCount: 34
latestTags: '6.2.0, 6.1.8, 6.1.7, 6.1.6, 6.1.5, 6.1.4, 6.1.3, 6.1.2, 6.1.1, 6.1.0, ...'
observedExclusionList:
- ^.*\.sig$
observedGeneration: 1
@darkowlzz does this PR solves https://github.com/fluxcd/image-reflector-controller/issues/166?
does this PR solves fluxcd/image-reflector-controller#166?
@stefanprodan yes. In https://github.com/fluxcd/image-reflector-controller/blob/5b6015b03766c99309a3418838c0df07b7aa3800/controllers/imagerepository_controller.go#L185-L188 it tries to only send notification when the result changes. I'll do some more testing to make sure it's handled well.
Removed alpha APIs and fixed the fuzz test failure along with some other minor changes to reduce the verbosity of the test logs from badger and test registry server.
Added a rough implementation of latest tags in ImageRepository status:
status:
canonicalImageName: ghcr.io/stefanprodan/podinfo
conditions:
- lastTransitionTime: "2022-09-13T23:23:46Z"
message: successful scan, found 34 tags
observedGeneration: 1
reason: Succeeded
status: "True"
type: Ready
lastScanResult:
latestTags: latest, 6.2.0, 6.1.8, 6.1.7, 6.1.6, 6.1.5, 6.1.4, 6.1.3, 6.1.2, 6.1.1
...
scanTime: "2022-09-13T23:48:52Z"
tagCount: 34
observedExclusionList:
- ^.*\.sig$
observedGeneration: 1
Looks like we may have to remove latest
from it.
Also added previous tag in the new ImagePolicy version events to address fluxcd/image-automation-controller#437 by including previous tag in the update event. When it's first policy apply, it results in event:
Latest image tag for 'ghcr.io/stefanprodan/podinfo' resolved to: 5.2.0
When there's an update, it results in event:
Latest image tag for 'ghcr.io/stefanprodan/podinfo' updated from 5.2.0 to 5.2.1
This requires introducing a new field in ImagePolicy, observedPreviousImage
to keep track of the previous image in order to prevent duplicate events on subsequent reconciliation without any actual update.
status:
conditions:
- lastTransitionTime: "2022-09-13T23:03:29Z"
message: Latest image tag for 'ghcr.io/stefanprodan/podinfo' updated from 5.2.0
to 5.2.1
observedGeneration: 1
reason: Succeeded
status: "True"
type: Ready
latestImage: ghcr.io/stefanprodan/podinfo:5.2.1
observedGeneration: 1
observedPreviousImage: ghcr.io/stefanprodan/podinfo:5.2.0
These are implemented in the last commit. Need some feedback and help in improving the messaging. I still need to test them more and write some more unit tests.
Looks like we may have to remove latest from it.
No we don't, we just trim the list to max 10 items without altering them.
Need some feedback and help in improving the messaging.
I suggest we do:
Latest image tag for 'ghcr.io/stefanprodan/podinfo' resolved to 5.2.0
Latest image tag for 'ghcr.io/stefanprodan/podinfo' updated from 5.2.0 to 5.2.1
Fix: fluxcd/image-automation-controller#437
I don't see how this would enhance the data available through https://pkg.go.dev/github.com/fluxcd/image-automation-controller/pkg/update? But, I could very well be overlooking something given my very limited Go knowledge :wink:
@tun0 you would setup alerting for ImagePolicy
objects, and those will tell you "Latest image tag for 'ghcr.io/stefanprodan/podinfo' updated from 5.2.0 to 5.2.1", then the ImageAutomation
alert will tell you "Image 'ghcr.io/stefanprodan/podinfo' updated from 5.2.1".
The ImagePolicy
stuff doesn't have any workload specific information though. The policy could've gone from 5.2.0 to 5.2.1, which won't say anything about a workload that was still on 5.1.0 for some reason. Perhaps it's a new resource that was just added to git with an outdated tag. Ideally I'd be able to see detailed version transitions per workload, not per policy. In the end, the people looking these notifications, in our case at least, care about workloads, not image policies.
OK so that issue is wrongly placed here, @darkowlzz please move it or create a new one in IAC.
Okay, my bad. I moved it to image-reflector yesterday because of the misunderstanding. I'll move it to IAC again and add some more context around it. Thanks @tun0 for the clarification.
Did some polishing of the latest tags and previous - new tag implementation with tests. Everything looked fine on some testing. I'll do some more testing while writing spec docs next.
I need some more help with the emitted events for ImageRepository. A new ImageRepository which is reconciled for the first time emits two events:
29s Normal Succeeded imagerepository/podinfo successful scan, next scan in 5m0s
29s Normal Succeeded imagerepository/podinfo successful scan, found 33 tags
The first event is only a K8s event and it's also logged to provide user a signal that reconciliation took place and when the next scan would take place. Since the scanning is done only at specific periods, I think it's useful to log when it'll actually scan whenever a reconciliation runs. The second event is a K8s event and a notification controller alert. This is used to notify when there's an update in the scan result.
The first part of both the events look redundant if one checks the k8s events. Maybe we can change one of them to be different?
Also, when ImageRepository is triggered to reconcile before the scan time, maybe due to an update in the interval value or some other configuration, it does a no-op reconciliation and emits K8s event and logs:
31m Normal Succeeded imagerepository/podinfo no change in repository configuration since last scan, next scan in 2m41.993960864s
Just highlighting this behavior to get any feedback on this.
no change in repository configuration since last scan, next scan in 2m
This is confusing to me, what configuration are we referring to? If this is a no-op because the tags in the cache match the ones from upstream, then let’s log only:
no new tags found since last scan, next scan in 2m
This is confusing to me, what configuration are we referring to?
It's referring to the image URL and the exclusion list at the moment. When they are changed, it performs a scan even if it's not time for a scan based on previous results and resets the scan timer.
Regarding the no-op reconciliation, saying no new tags found since last scan
doesn't seem right because the no-op is because it's not scan time, not because of no new tags.
successful scan, found 33 tags
What happens at the next scan if the same tags are fetched, do we still issue an event or do we log only?
What happens at the next scan if the same tags are fetched, do we still issue an event or do we log only?
Then it logs and emits k8s only event: successful scan, next scan in 5m0s
.
Then it logs and emits k8s only event: successful scan, next scan in 5m0s .
Let's make that "no new tags found, next scan in 5m"
The code should be ready for review now while I'm still working on the spec docs. https://github.com/fluxcd/pkg/pull/329 should be ready in its current state. I'll address the remaining comments before marking the PR ready.
Regarding the ImageRepository events discussed above, now it only emits single event per reconciliation with different messages. When a new repository is created and it scans successfully:
successful scan, found 211 tags
This is also sent to the notification controller.
On subsequent reconciliation at the scan interval with no new tags:
no new tags found, next scan in 1h0m0s
This is not sent to the notification controller. It's only logged and recorded as k8s event.
On reconciliation triggered before the scan interval without any configuration change that affect the scan result:
no change in repository configuration since last scan, next scan in 58m39.516114621s
This is also not sent to the notification controller. It's only logged and recorded as k8s event.
I've added the validation for v1.Duration
in the API and imagerepository.spec.provider
for contextual login.
At present, I've removed the autologin flags as well. I'm not sure if we should allow existing users to experience failure due to unknown flags or we should keep the flags, not do anything with it, and remove them later.
The last thing left is spec docs. Working on that next.
At present, I've removed the autologin flags as well. I'm not sure if we should allow existing users to experience failure due to unknown flags or we should keep the flags, not do anything with it, and remove them later.
After thinking more about it, I think we need to fail as this change requires updating the existing ImageRepository objects and not failing to run image-reflector-controller would eventually result in image scan to fail, so errors and user intervention are inevitable.
@darkowlzz when we decommissioned the feature flag for libgit2 managed transport we did a hard deprecation and removed the option on a minor release. I would argue that's a valid approach here too, as we are still pre-GA.
This is ready along with the spec docs. I've also updated the gists linked in the PR description that shows the status conditions in various scenarios. While writing the spec docs, I realized that the Reconciling conditions can provide better information about the state when Ready=False, like what's the reason for a scan, new image, new exclusion list, scan interval, etc. Added some detailed Reconciling conditions in both ImageRepository and ImagePolicy for the same and updated the spec docs accordingly.
While discussing about the changes here with Hidde, he suggested that status.lastScanResult.latestTags
could be an actual yaml list/array of 10 items instead of a string of comma separated values with ...
at the end. I think it's a good idea. Anybody has any opinion about it?
I'll update the implementation once we have a discussed agreement.
was 10 chosen for a specific reason or is it just an arbitrary number?
was 10 chosen for a specific reason or is it just an arbitrary number?
@aryan9600 that's from stefan's initial proposal https://github.com/fluxcd/image-reflector-controller/pull/311#issuecomment-1245032179 , to show latest 10 tags. I don't think there's any specific reason behind it.
We had a discussion last week where we decided to convert status.lastScanResult.latestTags
to an actual list/array.
Now it looks like:
lastScanResult:
latestTags:
- latest
- 6.3.3
- 6.3.2
- 6.3.1
- 6.3.0
- 6.2.3
- 6.2.2
- 6.2.1
- 6.2.0
- 6.1.8
scanTime: "2023-02-06T19:25:41Z"
tagCount: 41
Since tagCount
makes it clear that there are more tags, I decided to not add ...
at the end of the list. But I'm ready to change that if we still need ...
for visual aid.
Also, added the autologin flags back so that it doesn't cause a failure but log an error:
{"level":"error","ts":"2023-02-07T02:00:29.082+0530","logger":"setup","msg":"autologin flags have been deprecated and have no effect. These flags will be removed in a future release. Please update the respective ImageRepository objects with .spec.provider field.","error":"use of deprecated flags"}
I've updated and rebased the whole PR. This is ready for reviews and testing.
I added a getter for the Exclusion List that's in the ImageRepository API so that we avoid the issues we had with notification-controller new API fields with defaults and did some upgrade testing to see if anything breaks. Sharing the details below so that we can observe and discuss about it.
Manual upgrade test
With the current latest version of the controller and API, I created an ImageRepository and an ImagePolicy as below:
---
apiVersion: image.toolkit.fluxcd.io/v1beta1
kind: ImageRepository
metadata:
name: podinfo
namespace: default
spec:
image: ghcr.io/stefanprodan/podinfo
interval: 10m
---
apiVersion: image.toolkit.fluxcd.io/v1beta1
kind: ImagePolicy
metadata:
name: podinfo
namespace: default
spec:
imageRepositoryRef:
name: podinfo
policy:
semver:
range: 5.0.x
This resulted in
apiVersion: image.toolkit.fluxcd.io/v1beta1
kind: ImageRepository
...
spec:
image: ghcr.io/stefanprodan/podinfo
interval: 10m
status:
canonicalImageName: ghcr.io/stefanprodan/podinfo
conditions:
- lastTransitionTime: "2023-02-07T18:15:54Z"
message: successful scan, found 41 tags
reason: ReconciliationSucceeded
status: "True"
type: Ready
lastScanResult:
scanTime: "2023-02-07T18:15:54Z"
tagCount: 41
observedGeneration: 1
---
apiVersion: image.toolkit.fluxcd.io/v1beta1
kind: ImagePolicy
...
spec:
imageRepositoryRef:
name: podinfo
policy:
semver:
range: 5.0.x
status:
conditions:
- lastTransitionTime: "2023-02-07T18:16:38Z"
message: 'Latest image tag for ''ghcr.io/stefanprodan/podinfo'' resolved to: 5.0.3'
reason: ReconciliationSucceeded
status: "True"
type: Ready
latestImage: ghcr.io/stefanprodan/podinfo:5.0.3
observedGeneration: 1
Updated the API and the controller, not the object manifests themselves, the objects became:
apiVersion: image.toolkit.fluxcd.io/v1beta2
kind: ImageRepository
...
spec:
exclusionList:
- ^.*\.sig$
image: ghcr.io/stefanprodan/podinfo
interval: 10m
provider: generic
status:
canonicalImageName: ghcr.io/stefanprodan/podinfo
conditions:
- lastTransitionTime: "2023-02-07T19:18:01Z"
message: 'successful scan: found 41 tags'
observedGeneration: 1
reason: Succeeded
status: "True"
type: Ready
lastScanResult:
latestTags:
- latest
- 6.3.3
- 6.3.2
- 6.3.1
- 6.3.0
- 6.2.3
- 6.2.2
- 6.2.1
- 6.2.0
- 6.1.8
scanTime: "2023-02-07T19:18:01Z"
tagCount: 41
observedExclusionList:
- ^.*\.sig$
observedGeneration: 1
---
apiVersion: image.toolkit.fluxcd.io/v1beta2
kind: ImagePolicy
...
spec:
imageRepositoryRef:
name: podinfo
policy:
semver:
range: 5.0.x
status:
conditions:
- lastTransitionTime: "2023-02-07T19:18:01Z"
message: Latest image tag for 'ghcr.io/stefanprodan/podinfo' resolved to 5.0.3
observedGeneration: 1
reason: Succeeded
status: "True"
type: Ready
latestImage: ghcr.io/stefanprodan/podinfo:5.0.3
observedGeneration: 1
NOTE: Both the objects got converted to v1beta2 and some new default fields were added. In case of autologin, the user will have to make sure that the automatically populated .spec.provider
in ImageRepository has the appropriate value for their provider.
Updated the ImagePolicy semver range to see some new status fields:
apiVersion: image.toolkit.fluxcd.io/v1beta2
kind: ImagePolicy
...
spec:
imageRepositoryRef:
name: podinfo
policy:
semver:
range: 6.x
status:
conditions:
- lastTransitionTime: "2023-02-07T19:22:15Z"
message: Latest image tag for 'ghcr.io/stefanprodan/podinfo' updated from 5.0.3
to 6.3.3
observedGeneration: 2
reason: Succeeded
status: "True"
type: Ready
latestImage: ghcr.io/stefanprodan/podinfo:6.3.3
observedGeneration: 2
observedPreviousImage: ghcr.io/stefanprodan/podinfo:5.0.3
Previous image is being tracked in the status now which is used to construct update event message that shows the progression of the versions.
- Introduce observed previous image in ImagePolicy status to keep track of the previous image so that it can provide update path information.
The way I read this, is that this is about latest from the point-of-view of the policy. Our existing Flux v1 notifications show the "actual" previous image, as in: what was actually configured in git for the specific workload.
The difference is subtle, but especially for new workloads it could be confusing:
- previousTag: foo
- newTag: bar
- create new workload X with "ancient" tag: qux This would trigger a notification for X mentioning updating from foo to bar, whereas it should actually say from qux to bar.
I understand this is a bit of a corner-case, but on the other hand I also kinda treat this as feature parity with Flux v1.
- create new workload X with "ancient" tag: qux This would trigger a notification for X mentioning updating from foo to bar, whereas it should actually say from qux to bar.
@tun0 Sounds like a confusion we had before, earlier in the thread. This notification is only for the image policy, independent of the workload. We'll add similar notification in image-automation-controller that'll be specific to the workload and would show the previously used tag that was in git and the new tag.
- create new workload X with "ancient" tag: qux This would trigger a notification for X mentioning updating from foo to bar, whereas it should actually say from qux to bar.
@tun0 Sounds like a confusion we had before, earlier in the thread. This notification is only for the image policy, independent of the workload. We'll add similar notification in image-automation-controller that'll be specific to the workload and would show the previously used tag that was in git and the new tag.
Ah, right. Makes sense!