gateway
gateway copied to clipboard
Errors during reconcile causes a creation of partial xDS configuraion
Description:
After upgrading from v1.3.2 to v1.4.0 we started to see filter_chain_not_found and cluster_not_found errors in envoy access logs.
The errors disappeared after few hours.
The errors started with a specific xDS update and stopped after another update.
We discovered a single error log exactly on the time the errors started (filter chain not found):
"ERROR provider kubernetes/controller.go:237 failed processGateways for gatewayClass eg, skipping it {"runner": "provider", "error": "failed to list ***, Kind=***: etcdserver: leader changed"}"
We noticed #5953 which replaced all return reconcile.Result{}, err statements with continue statements in Reconcile function.
This explains the behavior we observed since gateway resources are stored here regardless of errors during reconcile process which can lead to a storage of a partial state like in our case.
Furthermore, reconcile.Result{}, nil is always returned from Reconcile function regardless of any errors in the process, which means controller-runtime backoff mechanism will not be triggered. In case of a transient error(e.g. etcd error like in our case) we'll have to wait for some watch event so Reconcile will be called again.
Repro steps: Produce some transient error in the middle of Reconcile process.
Environment: EG v1.4.0
cc @arkodg
cc @arkodg . This was discussed in yesterday's community meeting. Main points were:
- Explore handling transient errors differently from other (validation, ...) errors, e.g. retry at least in case of transient errors
- Check if we can store configs only for successfully GWCs without wiping out existing config for errored GWCs
yeah, we can semi revert the logic, to return early with an err from the Reconcile if we get any Kube client errors not tied to bad config ( like a SP Ext proc config pointing to an invalid backend )
@arkodg @guydc would it be sufficient if controller checked err against set of https://pkg.go.dev/k8s.io/apimachinery/pkg/api/errors errors? like e.g.:
func isTransientError(err error) bool {
return kerrors.IsServerTimeout(err) ||
kerrors.IsTimeout(err) ||
kerrors.IsTooManyRequests(err) ||
kerrors.IsServiceUnavailable(err) ||
kerrors.IsStoreReadError(err) ||
kerrors.IsUnexpectedServerError(err)
}
func (r *gatewayAPIReconciler) Reconcile(ctx context.Context, _ reconcile.Request) (reconcile.Result, error) {
// Add all Gateways, their associated Routes, and referenced resources to the resourceTree
if err = r.processGateways(ctx, managedGC, resourceMappings, gwcResource); err != nil {
if isTransientError(err) {
return reconcile.Result{}, err
}
r.log.Error(err, fmt.Sprintf("failed processGateways for gatewayClass %s, skipping it", managedGC.Name))
continue
}
...
That's a solid approach @patrostkowski, +1
/assign