kpack icon indicating copy to clipboard operation
kpack copied to clipboard

Builders do not recover from temporary issues.

Open phil9909 opened this issue 4 years ago • 7 comments

  1. I created a CustomBuilder object and during the creation time the registry specified was down.
  2. As expected the Ready condition is set to false.
  3. The registry became available shortly after.

Expected behavior: Eventually the Ready condition will transition to true (if there are no further errors of course) Observed behavior: Ready stays false forever.

I think this is caused by unconditionally wrapping any error with NewPermanentError: https://github.com/pivotal/kpack/blob/0836eeb5791ce1961dbd68bcb6f2802d1034035c/pkg/reconciler/builder/builder.go#L88-L98

According to knative documentation:

Users can wrap an error as permanentError with this in reconcile when they do not expect the key to get re-queued.

IMHO this should not be used for external dependencies to be more resilient against short unavailabilities.

phil9909 avatar Nov 17 '20 14:11 phil9909

+1 for having this in place! Does anyone have any updates on when a potential solution for this could land? Thx!

BRONSOLO avatar Feb 18 '22 15:02 BRONSOLO

Hi guys, if anyone else is working on this I will like to give it a shot!

jjbustamante avatar Aug 02 '22 15:08 jjbustamante

Ok, thinking a little bit on this issue, I am going to share my first thoughts.

Kpack is implementing a custom Client to interact with the registry, this client is based in the GGCR library and they already have a retry logic mechanism implemented for handling network issues, but I think the default configuration is not enough for us. Maybe we can discuss which is a healthy Backoff configuration for Kpack.

Here is the code I am talking about

func (t *Client) Fetch(keychain authn.Keychain, repoName string) (v1.Image, string, error) {
        // some code...

        // Current code, I think it uses the defaultRetryBackoff
	image, err := remote.Image(reference, remote.WithAuthFromKeychain(keychain))

       // Proposal, let's use some different Backoff configuration
	image, err := remote.Image(reference, remote.WithAuthFromKeychain(keychain), remote.WithRetryBackoff(remote.Backoff{}))
        
	if err != nil {
		return nil, "", err
	}

	// more code ...
	return image, identifier, nil
}

The defaultRetryBackoff is configured as follows:

var defaultRetryBackoff = Backoff{
	Duration: 1.0 * time.Second,
	Factor:   3.0,
	Jitter:   0.1,
	Steps:    3,
}

Which from the code comments it means "Try this three times, waiting 1s after first failure, 3s after second". Maybe if we just configured some different values we can fix this issues without changing the internal behavior of using NewPermanentError. I am not saying we shouldn't have to, but I think the retry logic for image fetching is the most commons use case. My question here will be: The Pod is going to run longer before the Ready state is set to False

jjbustamante avatar Aug 02 '22 16:08 jjbustamante

I think updating the ggcr retry logic to better match our needs makes sense! However, I think sometimes the backing registry is completely down and it may be more appropriate to retry the reconcile many minutes later as well.

matthewmcnew avatar Aug 02 '22 17:08 matthewmcnew

Thanks for the feedback @matthewmcnew based on that, the other think we can do is the following:

  1. In the same Kpack client code, we can wrap the error with some message to inform the parent logic to retry later, something like:
func (t *Client) Fetch(keychain authn.Keychain, repoName string) (v1.Image, string, error) {
        // some code...

        // Proposal, let's use some different Backoff configuration
	image, err := remote.Image(reference, remote.WithAuthFromKeychain(keychain), remote.WithRetryBackoff(remote.Backoff{}))
        
	if err != nil {
		return nil, "", errors.Wrap(err, "retry-able")
	}

	// more code ...
	return image, identifier, nil
}
  1. In the Builder reconciler, we can use this message to re-queue the pod later
var retryableError = errors.New( "retry-able")

func (c *Reconciler) Reconcile(ctx context.Context, key string) error {
	// some code

	builderRecord, creationError := c.reconcileBuilder(ctx, builder)
	if creationError != nil {
		builder.Status.ErrorCreate(creationError)

		err := c.updateStatus(ctx, builder)
		if err != nil {
			return err
		}

                 // We can add something like this
		if errors.Is(creationError, retryableError) {
			return controller.NewRequeueAfter(time.Minute * 30)
		}
		return controller.NewPermanentError(creationError)
	}

	builder.Status.BuilderRecord(builderRecord)
	return c.updateStatus(ctx, builder)
}

jjbustamante avatar Aug 02 '22 20:08 jjbustamante

Note on deliverable:

  • We should ensure that ClusterStores and ClusterStacks are resilient to transient network errors as well

tylerphelan avatar Aug 10 '22 19:08 tylerphelan

Hi @tylerphelan thanks for the feedback.

I just refactored a little bit the PR to add support for the following Reconcilers:

  • Builder
  • ClusterBuilder
  • ClusterStore
  • ClusterStack

If you can take a look will be great, at least to validate the refactoring

jjbustamante avatar Aug 16 '22 23:08 jjbustamante

Is exponential backoff a good idea here?

tylerphelan avatar Aug 17 '22 15:08 tylerphelan

Is exponential backoff a good idea here?

I would imagine that it would be ideal 🙂

matthewmcnew avatar Aug 17 '22 18:08 matthewmcnew

After addressing some feedback from @matthewmcnew I think the exponential backoff is already in place! take a look at the following log from the controller when I disconnected my network card to create some network issue after creating a CustomBuilder.

Screen Shot 2022-08-17 at 4 23 44 PM

jjbustamante avatar Aug 17 '22 21:08 jjbustamante