image-automation-controller
image-automation-controller copied to clipboard
Image-Automation-Controller doesn't correctly set status of removed resources to "deleted" in exposed metrics
Describe the bug
Our team removed all resources for a project called platform-toy. The imageupdateautomation object and the associated gitrepository objects were removed from the cluster and gitrepository flux is connected to.
Upon removing all resources related to the project we began receiving alerts that the imageupdateautomation object was not able to reconcile despite the object not existing.
Further investigation revealed that the metric: "gotk_reconcile_condition{kind="ImageUpdateAutomation",name="platform-toy",namespace="platform-toy",status="False",type="Ready"}" exposed by the image-automation-controller was set to true and the metric: "gotk_reconcile_condition{kind="ImageUpdateAutomation",name="platform-toy",namespace="platform-toy",status="Deleted",type="Ready"} " was set to false after the object was removed from the cluster.
Our alerting is configured to alert us if: "gotk_reconcile_condition{kind="ImageUpdateAutomation",name="platform-toy",namespace="platform-toy",status="False",type="Ready"}" is set to 1
It seems that after an imageupdateautomation object is removed from the cluster, the image-automation-controller does not correctly identify that the imageupdateautomation object has been deleted and does not correctly update its metrics.
Upon restarting the image-automation-controller, the platform-toy imageupdateautomation object is not seen by the controller anymore and the alerting stops since the metric is no longer advertised.
Steps to reproduce
- Create the target namespace: platform-toy
- Deploy an imageupdateautomation object with name platform-toy
- Delete the imageupdateautomation object and namespace from the cluster
- Check the "/metrics" endpoint of the image-automation-controller to see what metrics are being exposed
- The image-automation-controller will expose the metric: "gotk_reconcile_condition{kind="ImageUpdateAutomation",name="platform-toy",namespace="platform-toy",status="False",type="Ready"}" 1
Expected behavior
The image-automation-controller should expose the following metric: gotk_reconcile_condition{kind="ImageUpdateAutomation",name="platform-toy",namespace="platform-toy",status="Deleted",type="Ready"} 1 Or the image-automation-controller should remove the resource from it's metrics endpoint entirely.
Screenshots and recordings
The following is the console output before the imageupdateautomation object is removed from the cluster: $ kubetl get imageupdateautomation -n platform-toy NAME LAST RUN platform-toy
The metrics exposed by the image-automation-controller are the following: gotk_reconcile_condition{kind="ImageUpdateAutomation",name="platform-toy",namespace="platform-toy",status="Deleted",type="Ready"} 0 gotk_reconcile_condition{kind="ImageUpdateAutomation",name="platform-toy",namespace="platform-toy",status="False",type="Ready"} 1 gotk_reconcile_condition{kind="ImageUpdateAutomation",name="platform-toy",namespace="platform-toy",status="True",type="Ready"} 0 gotk_reconcile_condition{kind="ImageUpdateAutomation",name="platform-toy",namespace="platform-toy",status="Unknown",type="Ready"} 0
The following console output is after the resources have been deleted: $ kubectl get imageupdateautomation -n platform-toy No resources found in platform-toy namespace.
The metrics exposed by the image-automation-controller are the following after deleting the resource: gotk_reconcile_condition{kind="ImageUpdateAutomation",name="platform-toy",namespace="platform-toy",status="Deleted",type="Ready"} 0 gotk_reconcile_condition{kind="ImageUpdateAutomation",name="platform-toy",namespace="platform-toy",status="False",type="Ready"} 1 gotk_reconcile_condition{kind="ImageUpdateAutomation",name="platform-toy",namespace="platform-toy",status="True",type="Ready"} 0 gotk_reconcile_condition{kind="ImageUpdateAutomation",name="platform-toy",namespace="platform-toy",status="Unknown",type="Ready"} 0
OS / Distro
VMware Photon OS/Linux
Flux version
flux version 0.38.2
Flux check
N/A
Git provider
gitlab
Container Registry provider
artifactory
Additional context
No response
Code of Conduct
- [X] I agree to follow this project's Code of Conduct