deadlock in startup for large clusters

Open sudheerv opened this issue 6 months ago • 1 comments

Channel event queue is a sized buffer (default=5000) and ends up blocking if there are events that are higher than the buffer size. This can happen on a large cluster with 5K nodes. Flannel sets up the informers during startup and while the informer callbacks occur asynchronously, NewSubnetManager uses a wait.PollUntilContextTimeout() to check for informer sync completion (with a timeout of 10 min). The problem is AddEvent blocks when the channel buffer is full holding the DeltaFIFO lock that is also needed by cache controller's HasSynced() which is checked in the callback for wait.PollUntilContextTimeout(). This results in a deadlock where the main thread is indefinitely blocked.

Below is the summary from stacktrace on a stuck flanneld due to this issue.

Goroutine	Waiting On	Why
81	Sending to a channel in `handleAddLeaseEvent()`	Channel is full or receiver is gone
1	`.HasSynced()` → needs lock	Main goroutine, can't proceed
88	`.Resync()` → needs lock	Reflector, also blocked
116	`.Update()` → needs lock	Reflector, blocked too

Jun 19 '25 01:06 sudheerv

flannel.stacktrace.hanging.rootcause.sanitized.txt flannel.stacktrace.hanging.sanitized.txt

Jun 19 '25 02:06 sudheerv