flannel icon indicating copy to clipboard operation
flannel copied to clipboard

deadlock in startup for large clusters

Open sudheerv opened this issue 6 months ago • 1 comments

Channel event queue is a sized buffer (default=5000) and ends up blocking if there are events that are higher than the buffer size. This can happen on a large cluster with 5K nodes. Flannel sets up the informers during startup and while the informer callbacks occur asynchronously, NewSubnetManager uses a wait.PollUntilContextTimeout() to check for informer sync completion (with a timeout of 10 min). The problem is AddEvent blocks when the channel buffer is full holding the DeltaFIFO lock that is also needed by cache controller's HasSynced() which is checked in the callback for wait.PollUntilContextTimeout(). This results in a deadlock where the main thread is indefinitely blocked.

Below is the summary from stacktrace on a stuck flanneld due to this issue.

Goroutine Waiting On Why
81 Sending to a channel in handleAddLeaseEvent() Channel is full or receiver is gone
1 .HasSynced() → needs lock Main goroutine, can't proceed
88 .Resync() → needs lock Reflector, also blocked
116 .Update() → needs lock Reflector, blocked too

sudheerv avatar Jun 19 '25 01:06 sudheerv