azure-container-networking icon indicating copy to clipboard operation
azure-container-networking copied to clipboard

Fix controller-manager failures causing CNS to remain incorrectly healthy

Open Copilot opened this issue 6 months ago • 5 comments

Problem

Currently, CNS will repeatedly try to start controller-manager even if it encounters terminal errors, keeping CNS in an incorrectly healthy state. This affects both controller-manager in CRD mode and multiTenantController in MultiTenantCRD mode.

Solution

This PR adds timeouts for controller startup and proper error handling to ensure CNS fails appropriately when controllers can't be started. Key changes:

  1. Added timeout contexts for both controller-manager and multiTenantController with a 5-minute default
  2. Added new metrics to track timeout occurrences:
    • cns_ctrlmanager_start_timeouts_total
    • cns_multitenantcontroller_start_timeouts_total
  3. Made CNS handle terminal errors more gracefully by propagating errors through channels instead of calling os.Exit directly from goroutines
  4. Simplified goroutine structure for starting controllers and improved error reporting
  5. Extracted common functionality to reusable functions for better maintainability

Before

  • CNS would indefinitely retry controller-manager.Start on failures
  • No timeout was set for controller startup attempts
  • Terminal errors wouldn't cause CNS to fail appropriately

After

  • Controller startup attempts have a reasonable timeout
  • Terminal errors are properly propagated and handled
  • CNS will report unhealthy status when controllers fail to start

These changes ensure CNS will not remain in an incorrectly healthy state when controllers fail to initialize properly.

Fixes #3687.


💡 You can make Copilot smarter by setting up custom instructions, customizing its development environment and configuring Model Context Protocol (MCP) servers. Learn more Copilot coding agent tips in the docs.

Copilot avatar May 27 '25 19:05 Copilot

@copilot Please make only the minimum changes necessary to guard manager.Start on L1615 in cns/service/main.go. This is the method call in question that must be fixed. Change it so the context passed into this has a reasonable timeout. If manager.Start on this line produces an error, or the context expires, the entire process should terminate.

timraymond avatar Jun 12 '25 14:06 timraymond

@copilot Please make only the minimum changes necessary to guard manager.Start on L1615 in cns/service/main.go. This is the method call in question that must be fixed. Change it so the context p...

Made minimal changes as requested in commit c9b5c00. Added 5-minute timeout context to manager.Start() and process terminates with os.Exit(1) if it fails or times out. Removed the retry loop to keep changes surgical.

Copilot avatar Jun 12 '25 14:06 Copilot

@timraymond Unfortunately I hit an unexpected error while processing your comment. I've automatically reported this to GitHub.

You can ask me to try again later by mentioning me in a new comment.

If you want to contact GitHub about this error, please mention the following identifier so they can better serve you: 054e6f75-7569-4569-ad2a-71faf0af21ad

Sorry for the inconvenience!

Copilot avatar Jun 12 '25 14:06 Copilot

@copilot fix lints

  cns/service/main.go:1619: File is not `gci`-ed with --skip-generated -s standard -s default (gci)
  		
  cns/service/main.go:1629: File is not `gci`-ed with --skip-generated -s standard -s default (gci)
  	
  cns/service/main.go:1642: File is not `gci`-ed with --skip-generated -s standard -s default (gci)

rbtr avatar Jun 13 '25 18:06 rbtr

@copilot fix lints

  cns/service/main.go:1619: File is not `gci`-ed with --skip-generated -s standard -s default (gci)

...

Fixed the gci linting issues by removing trailing whitespace from lines 1619, 1629, and 1642 in commit 9ee0904.

Copilot avatar Jun 13 '25 18:06 Copilot

This pull request is stale because it has been open for 2 weeks with no activity. Remove stale label or comment or this will be closed in 7 days

github-actions[bot] avatar Jun 28 '25 00:06 github-actions[bot]

Pull request closed due to inactivity.

github-actions[bot] avatar Jul 05 '25 00:07 github-actions[bot]