shell-operator
shell-operator copied to clipboard
Fix/stop informer
Overview
Implement graceful shutdown for client-go informers and add structured shutdown logs across shell-operator components.
What this PR does / why we need it
- Introduces proper lifecycle tracking and waiting for informer goroutines to exit:
- Adds a
donechannel toFactoryand startsinformer.Run(...)exactly once per factory; closesdonewhen the goroutine exits. FactoryStore.Stop(...)now, upon removing the last handler, cancels the shared context and waits for<-donebefore deleting the factory entry.- Adds
FactoryStore.WaitStopped(index)to block until a factory for a givenFactoryIndexfully stops.
- Adds a
- Adds completion signaling to informers:
resourceInformer.wait()waits for shared informer termination viaFactoryStore.WaitStopped.namespaceInformerwrapsRun()with adonechannel and exposeswait().
- Adds upper-level waits:
Monitornow hasWait()to wait for all resource/namespace informers to stop.KubeEventsManagernow hasWait()to aggregateMonitor.Wait()calls without holding locks while blocking.
- Makes operator shutdown truly graceful with logs:
Shutdown()now logs each stage, cancelsKubeEventsManager, then waits for completion (Wait()), then stops and waits for queues.
Special notes for your reviewer
- Concurrency:
KubeEventsManager.Wait()snapshots monitors underRLockand releases the lock before waiting, avoiding deadlocks and map iteration races.FactoryStore.Stop(...)waits outside the lock for thedonechannel, then reacquires the lock to delete the factory and broadcast to waiters.
- Logging:
- Added clear shutdown logs to
operator.Shutdown(); feel free to request additional log details or levels.
- Added clear shutdown logs to
- Risk/mitigation:
- If an informer fails to exit after
cancel(), waits will block. Current behavior intentionally favors correctness. If desired, we can add timeouts toWait()paths in a follow-up.
- If an informer fails to exit after
- How to validate:
- Start shell-operator; send SIGTERM/SIGINT; observe logs:
- “shutdown: begin” → “schedule manager stopped” → “kube events manager canceled, waiting for informers” → “kube events manager done” → “task queues stop signaled, waiting” → “task queues stopped”.
- Start shell-operator; send SIGTERM/SIGINT; observe logs: