gateway icon indicating copy to clipboard operation
gateway copied to clipboard

Errors during reconcile causes a creation of partial xDS configuraion

Open shahar-h opened this issue 5 months ago • 6 comments

Description: After upgrading from v1.3.2 to v1.4.0 we started to see filter_chain_not_found and cluster_not_found errors in envoy access logs. The errors disappeared after few hours. The errors started with a specific xDS update and stopped after another update. We discovered a single error log exactly on the time the errors started (filter chain not found):

"ERROR  provider  kubernetes/controller.go:237  failed processGateways for gatewayClass eg, skipping it  {"runner": "provider", "error": "failed to list ***, Kind=***: etcdserver: leader changed"}"

We noticed #5953 which replaced all return reconcile.Result{}, err statements with continue statements in Reconcile function. This explains the behavior we observed since gateway resources are stored here regardless of errors during reconcile process which can lead to a storage of a partial state like in our case.

Furthermore, reconcile.Result{}, nil is always returned from Reconcile function regardless of any errors in the process, which means controller-runtime backoff mechanism will not be triggered. In case of a transient error(e.g. etcd error like in our case) we'll have to wait for some watch event so Reconcile will be called again.

Repro steps: Produce some transient error in the middle of Reconcile process.

Environment: EG v1.4.0

shahar-h avatar Jun 10 '25 09:06 shahar-h

cc @arkodg

zirain avatar Jun 10 '25 09:06 zirain

cc @arkodg . This was discussed in yesterday's community meeting. Main points were:

  • Explore handling transient errors differently from other (validation, ...) errors, e.g. retry at least in case of transient errors
  • Check if we can store configs only for successfully GWCs without wiping out existing config for errored GWCs

guydc avatar Jun 11 '25 17:06 guydc

yeah, we can semi revert the logic, to return early with an err from the Reconcile if we get any Kube client errors not tied to bad config ( like a SP Ext proc config pointing to an invalid backend )

arkodg avatar Jun 11 '25 18:06 arkodg

@arkodg @guydc would it be sufficient if controller checked err against set of https://pkg.go.dev/k8s.io/apimachinery/pkg/api/errors errors? like e.g.:

func isTransientError(err error) bool {
	return kerrors.IsServerTimeout(err) ||
		kerrors.IsTimeout(err) ||
		kerrors.IsTooManyRequests(err) ||
		kerrors.IsServiceUnavailable(err) ||
		kerrors.IsStoreReadError(err) ||
		kerrors.IsUnexpectedServerError(err)
}

func (r *gatewayAPIReconciler) Reconcile(ctx context.Context, _ reconcile.Request) (reconcile.Result, error) {
		// Add all Gateways, their associated Routes, and referenced resources to the resourceTree
		if err = r.processGateways(ctx, managedGC, resourceMappings, gwcResource); err != nil {
			if isTransientError(err) {
				return reconcile.Result{}, err
			}
			r.log.Error(err, fmt.Sprintf("failed processGateways for gatewayClass %s, skipping it", managedGC.Name))
			continue
		}
...

patrostkowski avatar Jun 12 '25 17:06 patrostkowski

That's a solid approach @patrostkowski, +1

arkodg avatar Jun 12 '25 17:06 arkodg

/assign

patrostkowski avatar Jun 12 '25 18:06 patrostkowski