karmada icon indicating copy to clipboard operation
karmada copied to clipboard

Optimize binding controllers with fine-grained event filtering to reduce redundant reconciliations

Open rohan-019 opened this issue 2 months ago • 17 comments

Optimize binding controllers with custom event predicate

Summary

Introduces a custom predicate for binding controllers to filter events and reduce unnecessary reconciliations:

  • Ignores delete events (cleanup handled by finalizers)
  • Always processes create events
  • Processes updates only for meaningful changes (generation, DeletionTimestamp, key .spec fields)

Result: Better performance with no behavior changes.

Motivation

Current controllers watch resources with broad predicates, causing redundant reconciliations. Filtering irrelevant events reduces CPU/memory usage, API calls, and cache churn while preserving finalizer cleanup.

Changes

  • Added: ResourceBindingPredicate
  • Updated: Binding controllers to use the new predicate
  • Added: Unit tests

Predicate Logic

  • Create: Always processed
  • Delete: Ignored
  • ⚠️ Update: Only if generation changes, DeletionTimestamp is set, or critical .spec fields change (requiredBy, clusters, resource, gracefulEvictionTasks, suspension, preserveResourcesOnDeletion, conflictResolution)

Event Behavior

Event Type Before After Reason
Create ✅ Process ✅ Process Always needed
Delete ✅ Process ❌ Ignore Finalizers handle cleanup
Update (generation) ✅ Process ✅ Process Meaningful change
Update (status only) ✅ Process ❌ Ignore No reconciliation needed

Files Changed

Added:

  • pkg/controllers/binding/binding_predicate.go
  • pkg/controllers/binding/binding_predicate_test.go

Modified:

  • pkg/controllers/binding/binding_controller.go
  • pkg/controllers/binding/cluster_resource_binding_controller.go

Testing

  • Unit tests cover all event types
  • Delete events ignored as expected
  • Critical .spec changes trigger reconciliation
  • Finalizer cleanup verified via DeletionTimestamp

Backward Compatibility

  • No API changes
  • Finalizer cleanup preserved
  • Override policy watch behavior unchanged

Performance Impact

  • Fewer reconciliations
  • Reduced API calls and cache churn
  • Lower CPU/memory usage

Linked Issue

Part of: #6780

Checklist

  • [x] Unit tests added/passing
  • [x] No breaking changes
  • [x] Predicate applied to both RB and CRB controllers
  • [x] Finalizer handling verified via DeletionTimestamp path

rohan-019 avatar Sep 25 '25 06:09 rohan-019

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: Once this PR has been reviewed and has the lgtm label, please assign zach593 for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment Approvers can cancel approval by writing /approve cancel in a comment

karmada-bot avatar Sep 25 '25 06:09 karmada-bot

Summary of Changes

Hello @rohan-019, I'm Gemini Code Assist[^1]! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request enhances the performance of ResourceBinding and ClusterResourceBinding controllers by implementing a custom event predicate. This predicate selectively processes events, preventing redundant reconciliation loops for changes that do not impact the core logic, thereby improving resource efficiency and responsiveness across the system.

Highlights

  • Custom Event Predicate: Introduces a new ResourceBindingPredicate to intelligently filter events for ResourceBinding and ClusterResourceBinding controllers.
  • Optimized Event Handling: Delete events are now ignored (cleanup handled by finalizers), create events are always processed, and update events are processed only for meaningful changes (generation, DeletionTimestamp, or critical .spec fields).
  • Performance Improvement: The fine-grained event filtering significantly reduces unnecessary reconciliations, leading to lower CPU/memory usage, fewer API calls, and reduced cache churn.
  • No Behavior Change: The optimization is achieved without altering the existing functional behavior or backward compatibility of the controllers.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with :thumbsup: and :thumbsdown: on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

[^1]: Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

gemini-code-assist[bot] avatar Sep 25 '25 06:09 gemini-code-assist[bot]

:warning: Please install the 'codecov app svg image' to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

:x: Patch coverage is 83.33333% with 8 lines in your changes missing coverage. Please review. :white_check_mark: Project coverage is 45.87%. Comparing base (70411ab) to head (62be891). :warning: Report is 19 commits behind head on master.

Files with missing lines Patch % Lines
pkg/controllers/binding/binding_predicate.go 86.95% 6 Missing :warning:
pkg/controllers/binding/binding_controller.go 0.00% 1 Missing :warning:
...ers/binding/cluster_resource_binding_controller.go 0.00% 1 Missing :warning:
:exclamation: Your organization needs to install the Codecov GitHub app to enable full functionality.
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #6784      +/-   ##
==========================================
+ Coverage   45.74%   45.87%   +0.13%     
==========================================
  Files         689      691       +2     
  Lines       57161    57346     +185     
==========================================
+ Hits        26149    26310     +161     
- Misses      29383    29406      +23     
- Partials     1629     1630       +1     
Flag Coverage Δ
unittests 45.87% <83.33%> (+0.13%) :arrow_up:

Flags with carried forward coverage won't be shown. Click here to find out more.

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

:rocket: New features to boost your workflow:
  • :snowflake: Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

codecov-commenter avatar Sep 25 '25 06:09 codecov-commenter

Hey, @XiShanYongYe-Chang Please take a look at this!

rohan-019 avatar Sep 25 '25 06:09 rohan-019

Thanks a lot @rohan-019

XiShanYongYe-Chang avatar Sep 25 '25 13:09 XiShanYongYe-Chang

Thanks~ Can you help fix the lint error?

I tried several times to fix lint errors using golangci-lint run --fix pkg/controllers/binding/binding_predicate.go pkg/controllers/binding/cluster_resource_binding_controller.go and also attempted a manual fix, but ended up with lint errors. Could you help me figure out what’s going wrong?

rohan-019 avatar Sep 26 '25 05:09 rohan-019

Hi @rohan-019, since this PR is the first one in our series, I have quite a few review comments. ^-^

XiShanYongYe-Chang avatar Sep 28 '25 02:09 XiShanYongYe-Chang

Hi @rohan-019, since this PR is the first one in our series, I have quite a few review comments. ^-^

Hi @XiShanYongYe-Chang , I really appreciate your feedback, especially since this is the first PR in the series. I’ll go through all your comments carefully and make the necessary updates.

Thanks again for taking the time to review.

rohan-019 avatar Sep 28 '25 21:09 rohan-019

Hi @rohan-019, the newly added ut has failed.

XiShanYongYe-Chang avatar Sep 29 '25 01:09 XiShanYongYe-Chang

the newly added ut has failed.

Hey @XiShanYongYe-Chang , I initially removed the generation check based on your comment that the global event filter already handles it. However, after testing We found that both the new unit test (TestResourceBindingPredicate_Update/generation_change) and the related e2e test failed once this check was removed.

The reason is that once we override UpdateFunc with custom logic, the default GenerationChangedPredicate is no longer applied automatically. This means spec-only changes are not triggering reconciliation unless we explicitly handle the generation change inside our predicate.

To keep the behavior correct and align with the current tests, I’ve restored the != generation check in UpdateFunc.

rohan-019 avatar Sep 29 '25 03:09 rohan-019

We found that both the new unit test (TestResourceBindingPredicate_Update/generation_change) and the related e2e test failed once this check was removed.

This test case is problematic because it is unlikely that only the generation will change while other fields remain unchanged. Therefore, it needs to be redesigned or removed.

XiShanYongYe-Chang avatar Sep 29 '25 04:09 XiShanYongYe-Chang

To keep the behavior correct and align with the current tests, I’ve restored the != generation check in UpdateFunc.

There's no need to do this.

XiShanYongYe-Chang avatar Sep 29 '25 04:09 XiShanYongYe-Chang

We found that both the new unit test (TestResourceBindingPredicate_Update/generation_change) and the related e2e test failed once this check was removed.

This test case is problematic because it is unlikely that only the generation will change while other fields remain unchanged. Therefore, it needs to be redesigned or removed.

That makes sense, as the Generation changes are handled globally, so this predicate doesn't need to check for them. But instead of removing the generation check, I redesigned so that this predicate now focuses on catching edge cases where: -Generation hasn't changed BUT -Specific critical spec fields have changed (like RequiredBy, Clusters, etc.)

rohan-019 avatar Sep 29 '25 05:09 rohan-019

Hey @XiShanYongYe-Chang, Don't mind, I accidentally closed PR!!

rohan-019 avatar Sep 29 '25 06:09 rohan-019

Hi @XiShanYongYe-Chang I tried simplifying the deletion check from e.ObjectOld.GetDeletionTimestamp() == nil && e.ObjectNew.GetDeletionTimestamp() != nil to e.ObjectOld.GetDeletionTimestamp() != e.ObjectNew.GetDeletionTimestamp()

as suggested, but this change caused failures in the e2e tests (e.g., migration_and_rollback_test.go timing out).

The reason is that the simplified form triggers reconciliation not only when deletion starts (nil → non-nil), but also if the deletion timestamp is updated while the object is already marked for deletion. That extra trigger led to unexpected reconcile behavior and race conditions, which is why the e2e tests failed.

So it seems we need to keep the original, more restrictive check to ensure reconciliation only fires once at the beginning of deletion.

rohan-019 avatar Sep 29 '25 08:09 rohan-019

The reason is that the simplified form triggers reconciliation not only when deletion starts (nil → non-nil), but also if the deletion timestamp is updated while the object is already marked for deletion. That extra trigger led to unexpected reconcile behavior and race conditions, which is why the e2e tests failed.

Once the timestamp is set, it will not be modified again.

The E2E error should be just an occasional issue and likely not introduced by the current changes. Let's ignore it for now.

Can you help fix the lint errors and compress the commits into one?

XiShanYongYe-Chang avatar Sep 29 '25 09:09 XiShanYongYe-Chang

I guess this PR is part of #6780, so I updated the Fixes to Part of in the PR description.

RainbowMango avatar Oct 09 '25 08:10 RainbowMango