core Bubbled up errors should cause the chart-assignment-controller to crash

I encountered an issue where a helm chart I deployed caused an error in the k8s client-go library to get bubbled up to the chart-assignment-controller.

The controller's logs would then display the error, but not crash or restart, instead freezing and stopping deployment of all new ChartAssignments.

In my case, the log was:

"couldn't get resource list for external.metrics.k8s.io/v1beta1: Got empty response for: external.metrics.k8s.io/v1beta1"

I'm not too sure where in the controller's code the error is caught, but I think it should result in a crash, so the breakage isn't hidden. Either that or it should ignore that error gracefully, but I'm not sure where the error is even bubbling up, unfortunately..

I have an inkling that it's related to the cached discovery client? https://github.com/googlecloudrobotics/core/blob/main/src/go/pkg/synk/synk.go#L80-L94

Nov 29 '23 22:11 methylDragon

Another similar error:

{"timestamp":"2024-06-25T10:42:55.638819686Z","severity":"WARN","source":{"function":"github.com/googlecloudrobotics/core/src/go/pkg/controller/chartassignment.(*release).setFailed","file":"src/go/pkg/controller/chartassignment/release.go","line":206},"message":"chart failed (retrying)","phase":"Updating",
  "Error":"set default namespaces: discover server resources: unable to retrieve the complete list of server APIs: metrics.k8s.io/v1beta1:
    stale GroupVersion discovery: metrics.k8s.io/v1beta1"}

It feels like overkill to crash the process on any transient error from set default namespaces or discover server resources but if the exact form/wording of the error is changing I'm not sure what to suggest. Maybe a liveness check that fails if discover server resources fails consistently for 10s+?

Jun 25 '24 10:06 drigz

tracking it across a few retries might be the best way forward for now.

Jun 25 '24 15:06 ensonic