azure-container-networking
azure-container-networking copied to clipboard
Fix controller-manager failures causing CNS to remain incorrectly healthy
Problem
Currently, CNS will repeatedly try to start controller-manager even if it encounters terminal errors, keeping CNS in an incorrectly healthy state. This affects both controller-manager in CRD mode and multiTenantController in MultiTenantCRD mode.
Solution
This PR adds timeouts for controller startup and proper error handling to ensure CNS fails appropriately when controllers can't be started. Key changes:
- Added timeout contexts for both controller-manager and multiTenantController with a 5-minute default
- Added new metrics to track timeout occurrences:
cns_ctrlmanager_start_timeouts_totalcns_multitenantcontroller_start_timeouts_total
- Made CNS handle terminal errors more gracefully by propagating errors through channels instead of calling
os.Exitdirectly from goroutines - Simplified goroutine structure for starting controllers and improved error reporting
- Extracted common functionality to reusable functions for better maintainability
Before
- CNS would indefinitely retry controller-manager.Start on failures
- No timeout was set for controller startup attempts
- Terminal errors wouldn't cause CNS to fail appropriately
After
- Controller startup attempts have a reasonable timeout
- Terminal errors are properly propagated and handled
- CNS will report unhealthy status when controllers fail to start
These changes ensure CNS will not remain in an incorrectly healthy state when controllers fail to initialize properly.
Fixes #3687.
💡 You can make Copilot smarter by setting up custom instructions, customizing its development environment and configuring Model Context Protocol (MCP) servers. Learn more Copilot coding agent tips in the docs.
@copilot Please make only the minimum changes necessary to guard manager.Start on L1615 in cns/service/main.go. This is the method call in question that must be fixed. Change it so the context passed into this has a reasonable timeout. If manager.Start on this line produces an error, or the context expires, the entire process should terminate.
@copilot Please make only the minimum changes necessary to guard
manager.Starton L1615 incns/service/main.go. This is the method call in question that must be fixed. Change it so the context p...
Made minimal changes as requested in commit c9b5c00. Added 5-minute timeout context to manager.Start() and process terminates with os.Exit(1) if it fails or times out. Removed the retry loop to keep changes surgical.
@timraymond Unfortunately I hit an unexpected error while processing your comment. I've automatically reported this to GitHub.
You can ask me to try again later by mentioning me in a new comment.
If you want to contact GitHub about this error, please mention the following identifier so they can better serve you: 054e6f75-7569-4569-ad2a-71faf0af21ad
Sorry for the inconvenience!
@copilot fix lints
cns/service/main.go:1619: File is not `gci`-ed with --skip-generated -s standard -s default (gci)
cns/service/main.go:1629: File is not `gci`-ed with --skip-generated -s standard -s default (gci)
cns/service/main.go:1642: File is not `gci`-ed with --skip-generated -s standard -s default (gci)
@copilot fix lints
cns/service/main.go:1619: File is not `gci`-ed with --skip-generated -s standard -s default (gci)
...
Fixed the gci linting issues by removing trailing whitespace from lines 1619, 1629, and 1642 in commit 9ee0904.
This pull request is stale because it has been open for 2 weeks with no activity. Remove stale label or comment or this will be closed in 7 days
Pull request closed due to inactivity.