enhancements icon indicating copy to clipboard operation
enhancements copied to clipboard

KEP-5237: watch-based route controller reconciliation using informers

Open lukasmetzner opened this issue 7 months ago • 17 comments

  • One-line PR description: Introduce a feature gate to enable informer-based reconciliation in the routes' controller of cloud-controller-manager, reducing API calls and improving efficiency.
  • Issue link: https://github.com/kubernetes/enhancements/issues/5237
  • Other comments:

lukasmetzner avatar May 08 '25 08:05 lukasmetzner

Welcome @lukasmetzner!

It looks like this is your first PR to kubernetes/enhancements 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes/enhancements has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. :smiley:

k8s-ci-robot avatar May 08 '25 08:05 k8s-ci-robot

Hi @lukasmetzner. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

k8s-ci-robot avatar May 08 '25 08:05 k8s-ci-robot

/cc @elmiko @joelspeed

lukasmetzner avatar May 08 '25 08:05 lukasmetzner

/ok-to-test

apricote avatar May 08 '25 14:05 apricote

@elmiko If possible, I’d love to work through the open questions here in the PR so we can keep things moving, and then have a discussion in the next SIG meeting. Otherwise, we might end up losing a two of weeks of time. What do you think?

Open Questions:

  • What should be the default frequency for the periodic full reconciliation?
    • Input from @JoelSpeed: We should do it similar to other controllers and choose a random time between 12h and 24h.
    • As we use the same shared informers' factory as in other controllers, this should already be implemented.
  • Are there other Node fields besides node.status.addresses and PodCIDRs that should trigger a route update?
  • How should we set the interval for the periodic reconcile? Options:
    • Adjust --route-reconcile-period when feature gate enabled
    • Use --min-resync-period; currently defaults to 12h
    • Introduce a new flag
    • If we use the 12h-24h option we can probably reuse --min-resync-period.

lukasmetzner avatar May 09 '25 10:05 lukasmetzner

i'm fine to continue the discussions here.

Input from @JoelSpeed: We should do it similar to other controllers and choose a random time between 12h and 24h.

i think 12h sounds fine to me.

Are there other Node fields besides node.status.addresses and PodCIDRs that should trigger a route update?

i'll have to think about this a little more, i have a feeling that those are good to start with.

How should we set the interval for the periodic reconcile?

i like the idea of adjusting the default for the --route-reconcile-period, but i don't want users to get confused about this.

my only issue with using --min-resync-period is that it sounds much more general and we are just focusing on the route controller.

elmiko avatar May 09 '25 18:05 elmiko

i like the idea of adjusting the default for the --route-reconcile-period, but i don't want users to get confused about this. my only issue with using --min-resync-period is that it sounds much more general and we are just focusing on the route controller

To my understanding, both the service and node controllers are already watch-based and should utilize the --min-resync-period flag. Adopting this approach would bring consistency across the CCM components. In this context, the --route-reconcile-period flag could be considered for deprecation.

lukasmetzner avatar May 21 '25 13:05 lukasmetzner

ack, thank you @lukasmetzner that makes sense to me. it seems we should focus on --min-resync period then.

elmiko avatar May 27 '25 17:05 elmiko

/auto-cc

Just a friendly reminder, as the 1.34 release window is closing :D

lukasmetzner avatar Jun 10 '25 08:06 lukasmetzner

@lukasmetzner do we need another round of reviews here?

elmiko avatar Jun 10 '25 15:06 elmiko

@elmiko From my point of view, I am happy with the content, and we could get this merged or move forward.

This is my first contribution, have I missed adding some labels? 😅

lukasmetzner avatar Jun 10 '25 16:06 lukasmetzner

This is my first contribution, have I missed adding some labels? 😅

no, i don't think so. just wanted to make sure i didn't miss anything. i will give another review before end of week.

elmiko avatar Jun 10 '25 17:06 elmiko

This is my first contribution, have I missed adding some labels? 😅

no, i don't think so. just wanted to make sure i didn't miss anything. i will give another review before end of week.

Sounds good. With the PRR freeze tomorrow, I guess we won't make it into the v1.34 milestone. I will bump the milestone targets then.

lukasmetzner avatar Jun 11 '25 10:06 lukasmetzner

Sounds good. With the PRR freeze tomorrow, I guess we won't make it into the v1.34 milestone. I will bump the milestone targets then.

ahh, apologies. i missed that the freeze is today. sorry for the inconvenience and delay.

elmiko avatar Jun 11 '25 16:06 elmiko

thanks for the updates @lukasmetzner , i probably won't get a chance to review today but i will review again early next week.

elmiko avatar Jun 13 '25 16:06 elmiko

@elmiko I forgot to update the milestones to now target v1.35 for the alpha.

lukasmetzner avatar Jul 02 '25 14:07 lukasmetzner

@JoelSpeed Just wondering if you’ve had a chance to review the PR again, or if you might be able to soon?

lukasmetzner avatar Jul 08 '25 08:07 lukasmetzner

/lgtm

JoelSpeed avatar Sep 17 '25 16:09 JoelSpeed

@elmiko @JoelSpeed As I have got lgtms from both of you with a small discussion remaining, shall I fill out the production readiness questionnaire? The next PRR freeze for v1.35 is on the 9th of October. If possible, I would like to target v1.35.

lukasmetzner avatar Sep 24 '25 10:09 lukasmetzner

yeah, that sounds good to me @lukasmetzner

elmiko avatar Sep 24 '25 14:09 elmiko

I saw that we already had the questionnaire filled out, so I moved it towards prod-readiness.

IMO we could also add a metric for A/B testing in beta (cc @JoelSpeed)? I would like to avoid missing the PRR freeze because of that :D

I already got lgtms on the KEP in general and have moved the state to implementable.

@elmiko Do you know if the lead-opted-in also needs to be applied to the issue, or only the PR? I used /label lead-opted-in, but I am unsure if this is something I should do, or a sig-cloud-provider lead.

Since I’m quite new here, I hope I’ve followed the process correctly. Apologies in advance if I’ve missed anything.

/assign @elmiko /assign @wojtek-t /label lead-opted-in

lukasmetzner avatar Sep 26 '25 09:09 lukasmetzner

@lukasmetzner: Can not set label lead-opted-in: Must be member in one of these teams: [release-team-enhancements release-team-leads sig-api-machinery-leads sig-apps-leads sig-architecture-leads sig-auth-leads sig-autoscaling-leads sig-cli-leads sig-cloud-provider-leads sig-cluster-lifecycle-leads sig-contributor-experience-leads sig-docs-leads sig-instrumentation-leads sig-k8s-infra-leads sig-multicluster-leads sig-network-leads sig-node-leads sig-release-leads sig-scalability-leads sig-scheduling-leads sig-security-leads sig-storage-leads sig-testing-leads sig-windows-leads]

In response to this:

I saw that we already had the questionnaire filled out, so I moved it towards prod-readiness.

IMO we could also add a metric for A/B testing in beta (cc @JoelSpeed)? I would like to avoid missing the PRR freeze because of that :D

I already got lgtms on the KEP in general and have moved the state to implementable.

@elmiko Do you know if the lead-opted-in also needs to be applied to the issue, or only the PR?

Since I’m quite new here, I hope I’ve followed the process correctly. Apologies in advance if I’ve missed anything.

/assign @elmiko /assign @wojtek-t /label lead-opted-in

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

k8s-ci-robot avatar Sep 26 '25 09:09 k8s-ci-robot

@elmiko Do you know if the lead-opted-in also needs to be applied to the issue, or only the PR? I used /label lead-opted-in, but I am unsure if this is something I should do, or a sig-cloud-provider lead.

good question, i will investigate early next week.

elmiko avatar Sep 26 '25 18:09 elmiko

IMO we could also add a metric for A/B testing in beta (cc @JoelSpeed)? I would like to avoid missing the PRR freeze because of that :D

I'm happy with that as long as we can leverage the metric in the beta stage with the feature both enabled and disabled to make a comparison between the two implementations to determine the performance impact

JoelSpeed avatar Oct 08 '25 16:10 JoelSpeed

PRR lgtm. Thank you for addressing metrics early, that always makes things easier for making alpha to beta decisions.

/approve

deads2k avatar Oct 13 '25 14:10 deads2k

@elmiko @JoelSpeed

Do you have a sig-cloud-provider owner to look at this?

kannon92 avatar Oct 13 '25 17:10 kannon92

/approve

elmiko avatar Oct 14 '25 14:10 elmiko

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: deads2k, elmiko, lukasmetzner

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment Approvers can cancel approval by writing /approve cancel in a comment

k8s-ci-robot avatar Oct 14 '25 14:10 k8s-ci-robot

/lgtm

elmiko avatar Oct 14 '25 14:10 elmiko

/assign @wojtek-t

Just for posterity - this proposal LGTM too.

wojtek-t avatar Oct 15 '25 13:10 wojtek-t