netbird icon indicating copy to clipboard operation
netbird copied to clipboard

[client] add reset for management backoff

Open gamerslouis opened this issue 2 months ago • 2 comments

Describe your changes

Reset client management grpc client backoff after successful connected to management API.

Current Situation: If the connection duration exceeds MaxElapsedTime, when the connection is interrupted, the backoff fails immediately due to timeout and does not actually perform a retry.

Issue ticket number and link

Stack

Checklist

  • [x] Is it a bug fix
  • [ ] Is a typo/documentation fix
  • [ ] Is a feature enhancement
  • [ ] It is a refactor
  • [ ] Created tests that fail without the change (if possible)

By submitting this pull request, you confirm that you have read and agree to the terms of the Contributor License Agreement.

Documentation

Select exactly one:

  • [ ] I added/updated documentation for this change
  • [ ] Documentation is not needed for this change (explain why)

Docs PR URL (required if "docs added" is checked)

Paste the PR link from https://github.com/netbirdio/docs here:

https://github.com/netbirdio/docs/pull/__

Summary by CodeRabbit

Release Notes

  • Bug Fixes
    • Enhanced client connection stability and resilience by improving the consistency of retry backoff behavior across all reconnection attempts and stream operations. The backoff strategy state is now properly maintained and reset during successful stream establishment, ensuring more predictable and efficient connection recovery during network disruptions and service interruptions.

✏️ Tip: You can customize this high-level summary in your review settings.

gamerslouis avatar Dec 11 '25 10:12 gamerslouis

Walkthrough

Introduces a shared backoff instance in GrpcClient.Sync, passes it into handleStream, updates handleStream signature, and resets the backoff on successful stream setup and after receiving events to maintain consistent retry state across stream retries.

Changes

Cohort / File(s) Summary
Grpc client backoff propagation
shared/management/client/grpc.go
Adds a local backOff in GrpcClient.Sync; uses it for the initial retry loop instead of creating fresh instances; passes the backoff into handleStream (signature changed to accept backoff.BackOff); resets the backoff on successful stream establishment and after processing events; replaces prior Retry usage to leverage the shared backoff.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~8 minutes

  • Single-file change with focused propagation of a backoff instance.
  • Pay attention to: updated handleStream signature usages, correct backoff resets timing (on stream setup and after event handling), and unchanged error/control-flow semantics.

Poem

🐇 I carried one backoff down the stream,
Not born anew with every retry's gleam.
We hop, we wait, then hop again—so sly,
A rabbit's rhythm beneath the sky. ✨

Pre-merge checks and finishing touches

✅ Passed checks (3 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately describes the main change: adding a backoff reset mechanism for the management gRPC client to fix timeout issues.
Description check ✅ Passed The description includes the required change explanation and bug fix checklist, but omits issue ticket number and documentation decision sections.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
✨ Finishing touches
  • [ ] 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • [ ] Create PR with unit tests
  • [ ] Post copyable unit tests in a comment

📜 Recent review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 1299d5603c5a00a0eeaf5ea26d7e9e3ee614cc03 and b6500b0bb6441b898b47cf03d74186595bd1efb6.

📒 Files selected for processing (1)
  • shared/management/client/grpc.go (4 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
shared/management/client/grpc.go (1)
util/retry.go (1)
  • Retry (10-28)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (20)
  • GitHub Check: Client / Unit (386)
  • GitHub Check: Management / Unit (amd64, mysql)
  • GitHub Check: Relay / Unit (386)
  • GitHub Check: Management / Unit (amd64, sqlite)
  • GitHub Check: Client / Unit (amd64)
  • GitHub Check: Relay / Unit (amd64, -race)
  • GitHub Check: Management / Unit (amd64, postgres)
  • GitHub Check: Client (Docker) / Unit
  • GitHub Check: Client / Unit
  • GitHub Check: JS / Lint
  • GitHub Check: Darwin
  • GitHub Check: Linux
  • GitHub Check: Windows
  • GitHub Check: release_ui
  • GitHub Check: release
  • GitHub Check: release_ui_darwin
  • GitHub Check: Android / Build
  • GitHub Check: Client / Unit
  • GitHub Check: iOS / Build
  • GitHub Check: Client / Unit
🔇 Additional comments (6)
shared/management/client/grpc.go (6)

114-115: LGTM! Clean backoff initialization.

Creating and resetting the backoff instance here ensures consistent retry state across stream retries within a single Sync call. This is the foundation for preventing immediate timeout when connection duration exceeds MaxElapsedTime.


134-134: LGTM! Correct propagation of shared backoff.

Passing the backoff instance to handleStream enables it to reset the retry state after successful operations, which is essential for the fix.


137-137: LGTM! Consistent use of shared backoff.

Using the shared backoff instance in the retry loop ensures consistent retry behavior across attempts.


145-146: Signature change is appropriate and non-breaking.

The handleStream method signature now includes the backOff backoff.BackOff parameter to receive the shared backoff instance. Since this is a private method, and the only caller at line 134 has been correctly updated to pass the new parameter, this change poses no breaking changes.


160-160: Verify intended backoff reset after successful stream connection.

Line 160 is marked as changed but appears blank. According to the AI summary and PR objectives, the backoff should be reset after successful stream establishment. While the reset at line 165 handles the retry case, a reset here would provide clearer semantics: "successful connection = reset retry state."

If a reset is intended here, apply this diff:

 	log.Infof("connected to the Management Service stream")
+	backOff.Reset()

 	c.notifyConnected()

Likely an incorrect or invalid review comment.


164-165: LGTM! This is the key fix for the MaxElapsedTime issue.

Resetting the backoff after receiveEvents returns ensures that even if the stream was active for longer than MaxElapsedTime (3 months), subsequent connection failures will start with a fresh retry budget rather than timing out immediately. The placement before error checking is correct since:

  • Successful reception followed by any error should reset retry state
  • Permanent errors will exit the retry loop regardless
  • Retryable errors benefit from fresh backoff

The Reset() method clears the elapsed-time timer, restarting the retry window from zero on each new connection attempt.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

coderabbitai[bot] avatar Dec 11 '25 10:12 coderabbitai[bot]