image-reflector-controller icon indicating copy to clipboard operation
image-reflector-controller copied to clipboard

Refactor reconcilers and introduce v1beta2 API

Open darkowlzz opened this issue 2 years ago • 22 comments

This change refactors image-reflector-controller to implement the new unified standards of flux controllers. It uses the result finalizer from https://github.com/fluxcd/pkg/pull/329 for computing the status conditions of ImageRepository and ImagePolicy objects such that they are kstatus compliant.

Following are some highlights of the changes:

  • Remove v1alpha API CRDs.
  • Introduce v1beta2 API which removes the old status condition readiness API and replaces them with the new conditions.Setter API from https://pkg.go.dev/github.com/fluxcd/pkg/[email protected]/conditions#Setter, along with other minor validation checks in the API fields.
  • Cleanup the controllers/ directory by moving the code that are not directly/strictly related to reconciliation operation into the internal/ directory, along with their tests.
  • Refactor the ImageRepository and ImagePolicy to use the helpers from https://pkg.go.dev/github.com/fluxcd/pkg/runtime.
  • The logs, events and notifications have been standardized for useful information to be helpful in debugging.
  • Introduce latest tags in ImageRepository status to list the 10 latest scanned tags.
  • Introduce observed previous image in ImagePolicy status to keep track of the previous image so that it can provide update path information.
  • The reconciliation business logic of both the reconcilers have been divided into smaller functions that are tested separately, out of the reconciliation loop.
  • The existing controller e2e tests weren't changed much in order to use them to ensure the behavior remains the same as much as possible. They are slightly modified to check the kstatus conditions after every run.
  • Cloud provider contextual login is configured per object with .spec.provider in ImageRepository.

The new status conditions under various scenarios can be seen in:

  • ImageRepository - https://gist.github.com/darkowlzz/19b0ef234f1c68baa42365726db1028e
  • ImagePolicy - https://gist.github.com/darkowlzz/bb017ba22a932fd4e7e2a4f421dbcbd2

Instructions for testing

To test this, checkout this branch and build the container image with make docker-build IMG=<image-name>. Before deploying the container image, update/install the CRDs to v1beta2 API with make install or apply the CRD definition manifests in config/crd/bases. Create image objects and test.

Fix: fluxcd/image-reflector-controller#259 Fix: fluxcd/image-reflector-controller#166

darkowlzz avatar Sep 13 '22 01:09 darkowlzz

@darkowlzz please delete the v1alpha CRDs, we haven't kept them around for the others controllers, so it's time to remove them here too.

stefanprodan avatar Sep 13 '22 07:09 stefanprodan

For ImageRepository In propose that we add the latest ten tags fetched from the registry. The tag list would help users debug polices by looking up .status.lastScanResult.latestTags e.g.:

status:
  canonicalImageName: ghcr.io/stefanprodan/podinfo
  conditions:
  - lastTransitionTime: "2022-09-12T11:59:49Z"
    message: successful scan, found 34 tags
    observedGeneration: 1
    reason: Succeeded
    status: "True"
    type: Ready
  lastScanResult:
    scanTime: "2022-09-12T11:59:49Z"
    tagCount: 34
    latestTags: '6.2.0, 6.1.8, 6.1.7, 6.1.6, 6.1.5, 6.1.4, 6.1.3, 6.1.2, 6.1.1, 6.1.0, ...'
  observedExclusionList:
  - ^.*\.sig$
  observedGeneration: 1

stefanprodan avatar Sep 13 '22 07:09 stefanprodan

@darkowlzz does this PR solves https://github.com/fluxcd/image-reflector-controller/issues/166?

stefanprodan avatar Sep 13 '22 08:09 stefanprodan

does this PR solves fluxcd/image-reflector-controller#166?

@stefanprodan yes. In https://github.com/fluxcd/image-reflector-controller/blob/5b6015b03766c99309a3418838c0df07b7aa3800/controllers/imagerepository_controller.go#L185-L188 it tries to only send notification when the result changes. I'll do some more testing to make sure it's handled well.

darkowlzz avatar Sep 13 '22 09:09 darkowlzz

Removed alpha APIs and fixed the fuzz test failure along with some other minor changes to reduce the verbosity of the test logs from badger and test registry server.

Added a rough implementation of latest tags in ImageRepository status:

status:
  canonicalImageName: ghcr.io/stefanprodan/podinfo
  conditions:
  - lastTransitionTime: "2022-09-13T23:23:46Z"
    message: successful scan, found 34 tags
    observedGeneration: 1
    reason: Succeeded
    status: "True"
    type: Ready
  lastScanResult:
    latestTags: latest, 6.2.0, 6.1.8, 6.1.7, 6.1.6, 6.1.5, 6.1.4, 6.1.3, 6.1.2, 6.1.1
      ...
    scanTime: "2022-09-13T23:48:52Z"
    tagCount: 34
  observedExclusionList:
  - ^.*\.sig$
  observedGeneration: 1

Looks like we may have to remove latest from it.

Also added previous tag in the new ImagePolicy version events to address fluxcd/image-automation-controller#437 by including previous tag in the update event. When it's first policy apply, it results in event:

Latest image tag for 'ghcr.io/stefanprodan/podinfo' resolved to: 5.2.0

When there's an update, it results in event:

Latest image tag for 'ghcr.io/stefanprodan/podinfo' updated from 5.2.0 to 5.2.1

This requires introducing a new field in ImagePolicy, observedPreviousImage to keep track of the previous image in order to prevent duplicate events on subsequent reconciliation without any actual update.

status:
  conditions:
  - lastTransitionTime: "2022-09-13T23:03:29Z"
    message: Latest image tag for 'ghcr.io/stefanprodan/podinfo' updated from 5.2.0
      to 5.2.1
    observedGeneration: 1
    reason: Succeeded
    status: "True"
    type: Ready
  latestImage: ghcr.io/stefanprodan/podinfo:5.2.1
  observedGeneration: 1
  observedPreviousImage: ghcr.io/stefanprodan/podinfo:5.2.0

These are implemented in the last commit. Need some feedback and help in improving the messaging. I still need to test them more and write some more unit tests.

darkowlzz avatar Sep 13 '22 23:09 darkowlzz

Looks like we may have to remove latest from it.

No we don't, we just trim the list to max 10 items without altering them.

Need some feedback and help in improving the messaging.

I suggest we do:

Latest image tag for 'ghcr.io/stefanprodan/podinfo' resolved to 5.2.0

Latest image tag for 'ghcr.io/stefanprodan/podinfo' updated from 5.2.0 to 5.2.1

stefanprodan avatar Sep 14 '22 08:09 stefanprodan

Fix: fluxcd/image-automation-controller#437

I don't see how this would enhance the data available through https://pkg.go.dev/github.com/fluxcd/image-automation-controller/pkg/update? But, I could very well be overlooking something given my very limited Go knowledge :wink:

tun0 avatar Sep 14 '22 11:09 tun0

@tun0 you would setup alerting for ImagePolicy objects, and those will tell you "Latest image tag for 'ghcr.io/stefanprodan/podinfo' updated from 5.2.0 to 5.2.1", then the ImageAutomation alert will tell you "Image 'ghcr.io/stefanprodan/podinfo' updated from 5.2.1".

stefanprodan avatar Sep 14 '22 12:09 stefanprodan

The ImagePolicy stuff doesn't have any workload specific information though. The policy could've gone from 5.2.0 to 5.2.1, which won't say anything about a workload that was still on 5.1.0 for some reason. Perhaps it's a new resource that was just added to git with an outdated tag. Ideally I'd be able to see detailed version transitions per workload, not per policy. In the end, the people looking these notifications, in our case at least, care about workloads, not image policies.

tun0 avatar Sep 14 '22 12:09 tun0

OK so that issue is wrongly placed here, @darkowlzz please move it or create a new one in IAC.

stefanprodan avatar Sep 14 '22 12:09 stefanprodan

Okay, my bad. I moved it to image-reflector yesterday because of the misunderstanding. I'll move it to IAC again and add some more context around it. Thanks @tun0 for the clarification.

darkowlzz avatar Sep 14 '22 13:09 darkowlzz

Did some polishing of the latest tags and previous - new tag implementation with tests. Everything looked fine on some testing. I'll do some more testing while writing spec docs next.

I need some more help with the emitted events for ImageRepository. A new ImageRepository which is reconciled for the first time emits two events:

29s         Normal   Succeeded   imagerepository/podinfo   successful scan, next scan in 5m0s
29s         Normal   Succeeded   imagerepository/podinfo   successful scan, found 33 tags

The first event is only a K8s event and it's also logged to provide user a signal that reconciliation took place and when the next scan would take place. Since the scanning is done only at specific periods, I think it's useful to log when it'll actually scan whenever a reconciliation runs. The second event is a K8s event and a notification controller alert. This is used to notify when there's an update in the scan result.

The first part of both the events look redundant if one checks the k8s events. Maybe we can change one of them to be different?

Also, when ImageRepository is triggered to reconcile before the scan time, maybe due to an update in the interval value or some other configuration, it does a no-op reconciliation and emits K8s event and logs:

31m         Normal   Succeeded   imagerepository/podinfo   no change in repository configuration since last scan, next scan in 2m41.993960864s

Just highlighting this behavior to get any feedback on this.

darkowlzz avatar Sep 14 '22 21:09 darkowlzz

no change in repository configuration since last scan, next scan in 2m

This is confusing to me, what configuration are we referring to? If this is a no-op because the tags in the cache match the ones from upstream, then let’s log only:

no new tags found since last scan, next scan in 2m

stefanprodan avatar Sep 15 '22 09:09 stefanprodan

This is confusing to me, what configuration are we referring to?

It's referring to the image URL and the exclusion list at the moment. When they are changed, it performs a scan even if it's not time for a scan based on previous results and resets the scan timer.

Regarding the no-op reconciliation, saying no new tags found since last scan doesn't seem right because the no-op is because it's not scan time, not because of no new tags.

darkowlzz avatar Sep 15 '22 09:09 darkowlzz

successful scan, found 33 tags

What happens at the next scan if the same tags are fetched, do we still issue an event or do we log only?

stefanprodan avatar Sep 15 '22 09:09 stefanprodan

What happens at the next scan if the same tags are fetched, do we still issue an event or do we log only?

Then it logs and emits k8s only event: successful scan, next scan in 5m0s .

darkowlzz avatar Sep 15 '22 09:09 darkowlzz

Then it logs and emits k8s only event: successful scan, next scan in 5m0s .

Let's make that "no new tags found, next scan in 5m"

stefanprodan avatar Sep 15 '22 09:09 stefanprodan

The code should be ready for review now while I'm still working on the spec docs. https://github.com/fluxcd/pkg/pull/329 should be ready in its current state. I'll address the remaining comments before marking the PR ready.

Regarding the ImageRepository events discussed above, now it only emits single event per reconciliation with different messages. When a new repository is created and it scans successfully:

successful scan, found 211 tags

This is also sent to the notification controller.

On subsequent reconciliation at the scan interval with no new tags:

no new tags found, next scan in 1h0m0s

This is not sent to the notification controller. It's only logged and recorded as k8s event.

On reconciliation triggered before the scan interval without any configuration change that affect the scan result:

no change in repository configuration since last scan, next scan in 58m39.516114621s

This is also not sent to the notification controller. It's only logged and recorded as k8s event.

darkowlzz avatar Sep 15 '22 23:09 darkowlzz

I've added the validation for v1.Duration in the API and imagerepository.spec.provider for contextual login. At present, I've removed the autologin flags as well. I'm not sure if we should allow existing users to experience failure due to unknown flags or we should keep the flags, not do anything with it, and remove them later.

The last thing left is spec docs. Working on that next.

darkowlzz avatar Oct 05 '22 17:10 darkowlzz

At present, I've removed the autologin flags as well. I'm not sure if we should allow existing users to experience failure due to unknown flags or we should keep the flags, not do anything with it, and remove them later.

After thinking more about it, I think we need to fail as this change requires updating the existing ImageRepository objects and not failing to run image-reflector-controller would eventually result in image scan to fail, so errors and user intervention are inevitable.

darkowlzz avatar Oct 06 '22 02:10 darkowlzz

@darkowlzz when we decommissioned the feature flag for libgit2 managed transport we did a hard deprecation and removed the option on a minor release. I would argue that's a valid approach here too, as we are still pre-GA.

pjbgf avatar Oct 06 '22 09:10 pjbgf

This is ready along with the spec docs. I've also updated the gists linked in the PR description that shows the status conditions in various scenarios. While writing the spec docs, I realized that the Reconciling conditions can provide better information about the state when Ready=False, like what's the reason for a scan, new image, new exclusion list, scan interval, etc. Added some detailed Reconciling conditions in both ImageRepository and ImagePolicy for the same and updated the spec docs accordingly.

darkowlzz avatar Oct 07 '22 12:10 darkowlzz

While discussing about the changes here with Hidde, he suggested that status.lastScanResult.latestTags could be an actual yaml list/array of 10 items instead of a string of comma separated values with ... at the end. I think it's a good idea. Anybody has any opinion about it? I'll update the implementation once we have a discussed agreement.

darkowlzz avatar Jan 17 '23 22:01 darkowlzz

was 10 chosen for a specific reason or is it just an arbitrary number?

aryan9600 avatar Jan 18 '23 06:01 aryan9600

was 10 chosen for a specific reason or is it just an arbitrary number?

@aryan9600 that's from stefan's initial proposal https://github.com/fluxcd/image-reflector-controller/pull/311#issuecomment-1245032179 , to show latest 10 tags. I don't think there's any specific reason behind it.

darkowlzz avatar Jan 18 '23 19:01 darkowlzz

We had a discussion last week where we decided to convert status.lastScanResult.latestTags to an actual list/array. Now it looks like:

  lastScanResult:
    latestTags:
    - latest
    - 6.3.3
    - 6.3.2
    - 6.3.1
    - 6.3.0
    - 6.2.3
    - 6.2.2
    - 6.2.1
    - 6.2.0
    - 6.1.8
    scanTime: "2023-02-06T19:25:41Z"
    tagCount: 41

Since tagCount makes it clear that there are more tags, I decided to not add ... at the end of the list. But I'm ready to change that if we still need ... for visual aid.

Also, added the autologin flags back so that it doesn't cause a failure but log an error:

{"level":"error","ts":"2023-02-07T02:00:29.082+0530","logger":"setup","msg":"autologin flags have been deprecated and have no effect. These flags will be removed in a future release. Please update the respective ImageRepository objects with .spec.provider field.","error":"use of deprecated flags"}

I've updated and rebased the whole PR. This is ready for reviews and testing.

darkowlzz avatar Feb 06 '23 20:02 darkowlzz

I added a getter for the Exclusion List that's in the ImageRepository API so that we avoid the issues we had with notification-controller new API fields with defaults and did some upgrade testing to see if anything breaks. Sharing the details below so that we can observe and discuss about it.

Manual upgrade test

With the current latest version of the controller and API, I created an ImageRepository and an ImagePolicy as below:

---
apiVersion: image.toolkit.fluxcd.io/v1beta1
kind: ImageRepository
metadata:
  name: podinfo
  namespace: default
spec:
  image: ghcr.io/stefanprodan/podinfo
  interval: 10m
---
apiVersion: image.toolkit.fluxcd.io/v1beta1
kind: ImagePolicy
metadata:
  name: podinfo
  namespace: default
spec:
  imageRepositoryRef:
    name: podinfo
  policy:
    semver:
      range: 5.0.x

This resulted in

apiVersion: image.toolkit.fluxcd.io/v1beta1 
kind: ImageRepository                                                                 
...
spec:                                                                                                                                                                        
  image: ghcr.io/stefanprodan/podinfo
  interval: 10m                                                                                                                                                              
status:                                                                               
  canonicalImageName: ghcr.io/stefanprodan/podinfo 
  conditions:
  - lastTransitionTime: "2023-02-07T18:15:54Z"                                                                                                                               
    message: successful scan, found 41 tags                 
    reason: ReconciliationSucceeded
    status: "True"                                                                    
    type: Ready
  lastScanResult:
    scanTime: "2023-02-07T18:15:54Z"
    tagCount: 41
  observedGeneration: 1
---
apiVersion: image.toolkit.fluxcd.io/v1beta1
kind: ImagePolicy
...
spec:
  imageRepositoryRef:
    name: podinfo
  policy:
    semver:
      range: 5.0.x
status:
  conditions:
  - lastTransitionTime: "2023-02-07T18:16:38Z"
    message: 'Latest image tag for ''ghcr.io/stefanprodan/podinfo'' resolved to: 5.0.3'
    reason: ReconciliationSucceeded
    status: "True"
    type: Ready
  latestImage: ghcr.io/stefanprodan/podinfo:5.0.3
  observedGeneration: 1

Updated the API and the controller, not the object manifests themselves, the objects became:

apiVersion: image.toolkit.fluxcd.io/v1beta2                                                    
kind: ImageRepository                                                                 
...
spec:                
  exclusionList:  
  - ^.*\.sig$    
  image: ghcr.io/stefanprodan/podinfo
  interval: 10m
  provider: generic
status:    
  canonicalImageName: ghcr.io/stefanprodan/podinfo                                                                     
  conditions:
  - lastTransitionTime: "2023-02-07T19:18:01Z"                                                                         
    message: 'successful scan: found 41 tags'                                                                          
    observedGeneration: 1
    reason: Succeeded
    status: "True"
    type: Ready                                
  lastScanResult:                              
    latestTags:                                
    - latest                                   
    - 6.3.3                                    
    - 6.3.2                                                
    - 6.3.1                                                                                                                                                                                   
    - 6.3.0                                    
    - 6.2.3                                                
    - 6.2.2                                                
    - 6.2.1                                                
    - 6.2.0                                                
    - 6.1.8                                                
    scanTime: "2023-02-07T19:18:01Z"                       
    tagCount: 41                                           
  observedExclusionList:                                   
  - ^.*\.sig$                                              
  observedGeneration: 1
---
apiVersion: image.toolkit.fluxcd.io/v1beta2
kind: ImagePolicy
...
spec:
  imageRepositoryRef:
    name: podinfo
  policy:
    semver:
      range: 5.0.x
status:
  conditions:
  - lastTransitionTime: "2023-02-07T19:18:01Z"
    message: Latest image tag for 'ghcr.io/stefanprodan/podinfo' resolved to 5.0.3
    observedGeneration: 1
    reason: Succeeded
    status: "True"
    type: Ready
  latestImage: ghcr.io/stefanprodan/podinfo:5.0.3
  observedGeneration: 1

NOTE: Both the objects got converted to v1beta2 and some new default fields were added. In case of autologin, the user will have to make sure that the automatically populated .spec.provider in ImageRepository has the appropriate value for their provider.

Updated the ImagePolicy semver range to see some new status fields:

apiVersion: image.toolkit.fluxcd.io/v1beta2
kind: ImagePolicy
...
spec:
  imageRepositoryRef:
    name: podinfo
  policy:
    semver:
      range: 6.x
status:
  conditions:
  - lastTransitionTime: "2023-02-07T19:22:15Z"
    message: Latest image tag for 'ghcr.io/stefanprodan/podinfo' updated from 5.0.3
      to 6.3.3
    observedGeneration: 2
    reason: Succeeded
    status: "True"
    type: Ready
  latestImage: ghcr.io/stefanprodan/podinfo:6.3.3
  observedGeneration: 2
  observedPreviousImage: ghcr.io/stefanprodan/podinfo:5.0.3

Previous image is being tracked in the status now which is used to construct update event message that shows the progression of the versions.

darkowlzz avatar Feb 07 '23 19:02 darkowlzz

  • Introduce observed previous image in ImagePolicy status to keep track of the previous image so that it can provide update path information.

The way I read this, is that this is about latest from the point-of-view of the policy. Our existing Flux v1 notifications show the "actual" previous image, as in: what was actually configured in git for the specific workload.

The difference is subtle, but especially for new workloads it could be confusing:

  • previousTag: foo
  • newTag: bar
  • create new workload X with "ancient" tag: qux This would trigger a notification for X mentioning updating from foo to bar, whereas it should actually say from qux to bar.

I understand this is a bit of a corner-case, but on the other hand I also kinda treat this as feature parity with Flux v1.

tun0 avatar Feb 09 '23 12:02 tun0

  • create new workload X with "ancient" tag: qux This would trigger a notification for X mentioning updating from foo to bar, whereas it should actually say from qux to bar.

@tun0 Sounds like a confusion we had before, earlier in the thread. This notification is only for the image policy, independent of the workload. We'll add similar notification in image-automation-controller that'll be specific to the workload and would show the previously used tag that was in git and the new tag.

darkowlzz avatar Feb 09 '23 12:02 darkowlzz

  • create new workload X with "ancient" tag: qux This would trigger a notification for X mentioning updating from foo to bar, whereas it should actually say from qux to bar.

@tun0 Sounds like a confusion we had before, earlier in the thread. This notification is only for the image policy, independent of the workload. We'll add similar notification in image-automation-controller that'll be specific to the workload and would show the previously used tag that was in git and the new tag.

Ah, right. Makes sense!

tun0 avatar Feb 09 '23 12:02 tun0