Bubbled up errors should cause the chart-assignment-controller to crash
I encountered an issue where a helm chart I deployed caused an error in the k8s client-go library to get bubbled up to the chart-assignment-controller.
The controller's logs would then display the error, but not crash or restart, instead freezing and stopping deployment of all new ChartAssignments.
In my case, the log was:
"couldn't get resource list for external.metrics.k8s.io/v1beta1: Got empty response for: external.metrics.k8s.io/v1beta1"
I'm not too sure where in the controller's code the error is caught, but I think it should result in a crash, so the breakage isn't hidden. Either that or it should ignore that error gracefully, but I'm not sure where the error is even bubbling up, unfortunately..
I have an inkling that it's related to the cached discovery client? https://github.com/googlecloudrobotics/core/blob/main/src/go/pkg/synk/synk.go#L80-L94
Another similar error:
{"timestamp":"2024-06-25T10:42:55.638819686Z","severity":"WARN","source":{"function":"github.com/googlecloudrobotics/core/src/go/pkg/controller/chartassignment.(*release).setFailed","file":"src/go/pkg/controller/chartassignment/release.go","line":206},"message":"chart failed (retrying)","phase":"Updating",
"Error":"set default namespaces: discover server resources: unable to retrieve the complete list of server APIs: metrics.k8s.io/v1beta1:
stale GroupVersion discovery: metrics.k8s.io/v1beta1"}
It feels like overkill to crash the process on any transient error from set default namespaces or discover server resources but if the exact form/wording of the error is changing I'm not sure what to suggest. Maybe a liveness check that fails if discover server resources fails consistently for 10s+?
tracking it across a few retries might be the best way forward for now.