cli-utils Bug: Inventory updates should tolerate drift (and overwrite it)

Right now, inventory updates may return a conflict error from Kubernetes. The inventory client should detect this (apierrors.IsConflict(err)) and retry with a new Get (to update the ResourceVersion) + Update.

Example retry code:

type retriable func(ctx context.Context) (retry bool, err error)

func retryWithBackoff(ctx context.Context, timeout time.Duration, fn retriable) error {
	var err error
	var retry bool
	ctx, cancel := context.WithTimeout(ctx, timeout)
	defer cancel()
	delay := 1 + time.Second
	for {
		// attempt to update status
		retry, err = fn(ctx)
		if !retry {
			return err
		}

		// wait until delay or timeout
		timer := time.NewTimer(delay)
		select {
		case <-ctx.Done():
			timer.Stop()
			return fmt.Errorf("timed out after retrying for %v: %w", timeout, err)
		case <-timer.C:
			// continue
		}
		// retry backoff
		delay = delay * 2
	}
}

example usage:

	// attempt to update status until timeout
	ctx := context.TODO()
	timeout := 1 * time.Minute
	return retryWithBackoff(ctx, timeout, func(ctx context.Context) (retry bool, err error) {
		// Get the object to get the latest ResourceVersion.
		latestObj, err := resource.Get(ctx, obj.GetName(), metav1.GetOptions{TypeMeta: meta})
		if err != nil {
			return false, fmt.Errorf("failed to get inventory status from cluster: %w", err)
		}
		// Ignore any status changes made remotely.
		// This update will replace them.
		obj.SetResourceVersion(latestObj.GetResourceVersion())

		_, err = resource.UpdateStatus(ctx, obj, metav1.UpdateOptions{TypeMeta: meta})
		if err != nil {
			// retry if conflict
			return apierrors.IsConflict(err), fmt.Errorf("failed to write updated inventory status to cluster: %w", err)
		}
		return false, nil
	})

Another option is to use https://github.com/flowchartsman/retry which is nice and generic. gcloud and client-go also have retry libs.

Mar 02 '22 04:03 karlkfi

The main client causing drift right now is the Config Sync resource-group-controller, which updates the ResourceGroup (inventory) status.

Mar 02 '22 04:03 karlkfi

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

May 31 '22 05:05 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

Jun 30 '22 06:06 k8s-triage-robot

/remove-lifecycle rotten /lifecycle frozen

Jul 25 '22 21:07 karlkfi

cli-utils cli-utils copied to clipboard

Bug: Inventory updates should tolerate drift (and overwrite it)

cli-utils
cli-utils copied to clipboard