kargo icon indicating copy to clipboard operation
kargo copied to clipboard

Distinguish permanent API errors from transient ones

Open hiddeco opened this issue 5 months ago • 0 comments

We do at present not distinguish "not found" errors (permanent) from e.g. "the Kubernetes API server temporary can not be reached" (transient). Because of this, a Stage's verification process may fail prematurely while the controller could theoretically automatically recover it, if given the time.

As manually recovering from it is both cumbersome to a user, and potentially a waste of computing power used by the AnalysisRun. I think we can do a better job at distinguishing these type of errors, and prevent giving up on transient ones by e.g. requeueing and not erasing AnalysisRun references, etc.

xref: https://github.com/akuity/kargo/pull/1611#discussion_r1525229572


Note: While I have only observed this to happen for a Stage's verification process, this may actually apply to more areas of Kargo.

hiddeco avatar Mar 15 '24 23:03 hiddeco