shell-operator icon indicating copy to clipboard operation
shell-operator copied to clipboard

Fix/stop informer

Open timmilesdw opened this issue 3 months ago • 0 comments

Overview

Implement graceful shutdown for client-go informers and add structured shutdown logs across shell-operator components.

What this PR does / why we need it

  • Introduces proper lifecycle tracking and waiting for informer goroutines to exit:
    • Adds a done channel to Factory and starts informer.Run(...) exactly once per factory; closes done when the goroutine exits.
    • FactoryStore.Stop(...) now, upon removing the last handler, cancels the shared context and waits for <-done before deleting the factory entry.
    • Adds FactoryStore.WaitStopped(index) to block until a factory for a given FactoryIndex fully stops.
  • Adds completion signaling to informers:
    • resourceInformer.wait() waits for shared informer termination via FactoryStore.WaitStopped.
    • namespaceInformer wraps Run() with a done channel and exposes wait().
  • Adds upper-level waits:
    • Monitor now has Wait() to wait for all resource/namespace informers to stop.
    • KubeEventsManager now has Wait() to aggregate Monitor.Wait() calls without holding locks while blocking.
  • Makes operator shutdown truly graceful with logs:
    • Shutdown() now logs each stage, cancels KubeEventsManager, then waits for completion (Wait()), then stops and waits for queues.

Special notes for your reviewer

  • Concurrency:
    • KubeEventsManager.Wait() snapshots monitors under RLock and releases the lock before waiting, avoiding deadlocks and map iteration races.
    • FactoryStore.Stop(...) waits outside the lock for the done channel, then reacquires the lock to delete the factory and broadcast to waiters.
  • Logging:
    • Added clear shutdown logs to operator.Shutdown(); feel free to request additional log details or levels.
  • Risk/mitigation:
    • If an informer fails to exit after cancel(), waits will block. Current behavior intentionally favors correctness. If desired, we can add timeouts to Wait() paths in a follow-up.
  • How to validate:
    • Start shell-operator; send SIGTERM/SIGINT; observe logs:
      • “shutdown: begin” → “schedule manager stopped” → “kube events manager canceled, waiting for informers” → “kube events manager done” → “task queues stop signaled, waiting” → “task queues stopped”.

timmilesdw avatar Sep 04 '25 10:09 timmilesdw