kpack
kpack copied to clipboard
Builders do not recover from temporary issues.
- I created a
CustomBuilder
object and during the creation time the registry specified was down. - As expected the
Ready
condition is set tofalse
. - The registry became available shortly after.
Expected behavior: Eventually the Ready
condition will transition to true
(if there are no further errors of course)
Observed behavior: Ready
stays false
forever.
I think this is caused by unconditionally wrapping any error with NewPermanentError
:
https://github.com/pivotal/kpack/blob/0836eeb5791ce1961dbd68bcb6f2802d1034035c/pkg/reconciler/builder/builder.go#L88-L98
According to knative documentation:
Users can wrap an error as permanentError with this in reconcile when they do not expect the key to get re-queued.
IMHO this should not be used for external dependencies to be more resilient against short unavailabilities.
+1 for having this in place! Does anyone have any updates on when a potential solution for this could land? Thx!
Hi guys, if anyone else is working on this I will like to give it a shot!
Ok, thinking a little bit on this issue, I am going to share my first thoughts.
Kpack is implementing a custom Client to interact with the registry, this client is based in the GGCR library and they already have a retry logic mechanism implemented for handling network issues, but I think the default configuration is not enough for us. Maybe we can discuss which is a healthy Backoff configuration for Kpack.
Here is the code I am talking about
func (t *Client) Fetch(keychain authn.Keychain, repoName string) (v1.Image, string, error) {
// some code...
// Current code, I think it uses the defaultRetryBackoff
image, err := remote.Image(reference, remote.WithAuthFromKeychain(keychain))
// Proposal, let's use some different Backoff configuration
image, err := remote.Image(reference, remote.WithAuthFromKeychain(keychain), remote.WithRetryBackoff(remote.Backoff{}))
if err != nil {
return nil, "", err
}
// more code ...
return image, identifier, nil
}
The defaultRetryBackoff is configured as follows:
var defaultRetryBackoff = Backoff{
Duration: 1.0 * time.Second,
Factor: 3.0,
Jitter: 0.1,
Steps: 3,
}
Which from the code comments it means "Try this three times, waiting 1s after first failure, 3s after second". Maybe if we just configured some different values we can fix this issues without changing the internal behavior of using NewPermanentError
. I am not saying we shouldn't have to, but I think the retry logic for image fetching is the most commons use case. My question here will be: The Pod is going to run longer before the Ready
state is set to False
I think updating the ggcr retry logic to better match our needs makes sense! However, I think sometimes the backing registry is completely down and it may be more appropriate to retry the reconcile many minutes later as well.
Thanks for the feedback @matthewmcnew based on that, the other think we can do is the following:
- In the same Kpack client code, we can wrap the error with some message to inform the parent logic to retry later, something like:
func (t *Client) Fetch(keychain authn.Keychain, repoName string) (v1.Image, string, error) {
// some code...
// Proposal, let's use some different Backoff configuration
image, err := remote.Image(reference, remote.WithAuthFromKeychain(keychain), remote.WithRetryBackoff(remote.Backoff{}))
if err != nil {
return nil, "", errors.Wrap(err, "retry-able")
}
// more code ...
return image, identifier, nil
}
- In the Builder reconciler, we can use this message to re-queue the pod later
var retryableError = errors.New( "retry-able")
func (c *Reconciler) Reconcile(ctx context.Context, key string) error {
// some code
builderRecord, creationError := c.reconcileBuilder(ctx, builder)
if creationError != nil {
builder.Status.ErrorCreate(creationError)
err := c.updateStatus(ctx, builder)
if err != nil {
return err
}
// We can add something like this
if errors.Is(creationError, retryableError) {
return controller.NewRequeueAfter(time.Minute * 30)
}
return controller.NewPermanentError(creationError)
}
builder.Status.BuilderRecord(builderRecord)
return c.updateStatus(ctx, builder)
}
Note on deliverable:
- We should ensure that ClusterStores and ClusterStacks are resilient to transient network errors as well
Hi @tylerphelan thanks for the feedback.
I just refactored a little bit the PR to add support for the following Reconcilers:
- Builder
- ClusterBuilder
- ClusterStore
- ClusterStack
If you can take a look will be great, at least to validate the refactoring
Is exponential backoff a good idea here?
Is exponential backoff a good idea here?
I would imagine that it would be ideal 🙂
After addressing some feedback from @matthewmcnew I think the exponential backoff is already in place! take a look at the following log from the controller when I disconnected my network card to create some network issue after creating a CustomBuilder.