deadlock in startup for large clusters
Channel event queue is a sized buffer (default=5000) and ends up blocking if there are events that are higher than the buffer size. This can happen on a large cluster with 5K nodes. Flannel sets up the informers during startup and while the informer callbacks occur asynchronously, NewSubnetManager uses a wait.PollUntilContextTimeout() to check for informer sync completion (with a timeout of 10 min). The problem is AddEvent blocks when the channel buffer is full holding the DeltaFIFO lock that is also needed by cache controller's HasSynced() which is checked in the callback for wait.PollUntilContextTimeout(). This results in a deadlock where the main thread is indefinitely blocked.
Below is the summary from stacktrace on a stuck flanneld due to this issue.
| Goroutine | Waiting On | Why |
|---|---|---|
| 81 | Sending to a channel in handleAddLeaseEvent() |
Channel is full or receiver is gone |
| 1 | .HasSynced() → needs lock |
Main goroutine, can't proceed |
| 88 | .Resync() → needs lock |
Reflector, also blocked |
| 116 | .Update() → needs lock |
Reflector, blocked too |