kapp-controller icon indicating copy to clipboard operation
kapp-controller copied to clipboard

Make kapp-controller more resilient to network failures while talking to the registry

Open joaopapereira opened this issue 3 years ago • 1 comments

Describe the problem/challenge you have

With this feature, we want to try and address some of the following problems

  • When an application is being reconciled, the lack of access to a registry, in some cases, should not make the reconciliation fail
  • When an application is being reconciled, in some cases, it should not need to retrieve information from the registry in order to complete the reconciliation

kapp-controller tries to get all the running applications to a consistent state every 10 minutes, which means that every 10 minutes, kapp-controller would need to retrieve the configuration, and template and apply them to the cluster the configuration. Having to do this when you have 10 clusters with 100 applications in each, means that every 10 minutes, kapp-controller would have to reach out 1000 times to a registry.

This behavior can cause the registry to be overwhelmed with the burst of requests and randomly fail to reply, but this also forces the kapp-controller to rely on the network to ensure eventual consistency of the resources in the cluster. There are some cases where kapp-controller knows that the configuration cannot be changed, ex: when PackageCR is pointing to an OCI image that is referenced by its SHA.

Describe the solution you'd like

A possible solution would be to implement features in kapp-controller, or the tools used by it, to understand if a particular piece of configuration can be reused after the initial retrieval and keep it, preventing reaching out to the registry to get it again.

For an MVP of this feature, we should consider that our target is configuration store in OCI images. Cases where kapp-controller should keep the configuration

  • PackageRepository Custom Resource is configured to retrieve the Packages from an OCI image while using the SHA of the image
  • Package Custom Resource is configured to retrieve the configuration from an OCI image while using the SHA of the image
  • App Custom Resource is configured to retrieve the configuration from an OCI image while using the SHA of the image
  • PackageRepository Custom Resource is configured to retrieve the Packages from an imgpkg bundle while using the SHA of the image We should only consider fully relocated bundles
  • Package Custom Resource is configured to retrieve the configuration from an imgpkg bundle while using the SHA of the image We should only consider fully relocated bundles
  • App Custom Resource is configured to retrieve the configuration from an imgpkg bundle while using the SHA of the image We should only consider fully relocated bundles

The rationale for only considering fully relocated bundles is the feature from imgpkg when it knows that all images are present in a particular repository it will update the ImagesLock, file in disk, to point to the images that are in this repository. This will allow Kubernetes to retrieve the images from a location that is nearer.

Anything else you would like to add:

This issue is going to be used as the base issue that will be used to track the feature since this work might span multiple tools

Status of the feature:

  • [x] https://github.com/vmware-tanzu/carvel-imgpkg/issues/390
  • [x] https://github.com/vmware-tanzu/carvel-vendir/issues/160
  • [ ] https://github.com/vmware-tanzu/carvel-kapp-controller/issues/688
  • [ ] https://github.com/vmware-tanzu/carvel-kapp-controller/issues/689

Vote on this request

This is an invitation to the community to vote on issues, to help us prioritize our backlog. Use the "smiley face" up to the right of this comment to vote.

👍 "I would like to see this addressed as soon as possible" 👎 "There are other more important things to focus on right now"

We are also happy to receive and review Pull Requests if you want to help working on this issue.

joaopapereira avatar May 05 '22 17:05 joaopapereira

This issue is being marked as stale due to a long period of inactivity and will be closed in 5 days if there is no response.

github-actions[bot] avatar Jun 23 '22 00:06 github-actions[bot]