Race Condition in Stream Multiplexing
Long story short
If one is lucky enough to hit the default settings.batching.batch_window for pause times on established multiplexed streams, events may not get dispatched. Instead they'll be lost and not processed at all. There is a race condition when dealing with the backlog queue.
Description
kopf.reactor.queueing.watcher(...) and kopf.reactor.queueing.worker(...) do not not property lock each other against backlog queue modifications leading to lost events.
After changing to a different environment with the operator I began seeing a lot of seemingly missed events. At first I thought the cluster API had issues because twice the missed events correlated with a 500 internal from the kube API very shortly before.
However after seeing many more missed events without any kube API server error before or after, I started debugging kopf.reactor.queueing.watcher(...) and kopf.reactor.queueing.worker(...).
What I found:
https://github.com/nolar/kopf/blob/e90ceaa9faeb000ea1ede52067b4389dd5609c92/kopf/reactor/queueing.py#L199 might successfully put an event without triggering a KeyError, which would spawn a new worker, while the task waiting for a new event to arrive at https://github.com/nolar/kopf/blob/e90ceaa9faeb000ea1ede52067b4389dd5609c92/kopf/reactor/queueing.py#L288-L290 might already have actually timed out but has not yet thrown b/c the event loop so far did not come back to it.
This leads to situations where a backlog.qsize() > 0 in https://github.com/nolar/kopf/blob/e90ceaa9faeb000ea1ede52067b4389dd5609c92/kopf/reactor/queueing.py#L291-L292 is possible and this is also what I'm seeing for each of the missed events in the operator. breaking at this point with a non-zero qsize will then discard the event, leaving it unprocessed, through the queue deletion happening at https://github.com/nolar/kopf/blob/e90ceaa9faeb000ea1ede52067b4389dd5609c92/kopf/reactor/queueing.py#L324-L330
Environment
- Kopf version: 1.30.2
- Kubernetes version: 1.17
- Python version: 3.9.1rc1
- OS/platform: Linux
@paxbit Thanks for reporting this issue with the details. That let me fully understand the issue directly from its description, without a repro. Indeed, this might be the case.
May I ask what is the scale of the cluster you are operating? Probably by the number of resources being operated, only the scale: tens, hundreds, thousands? I am trying to understand in which circumstances this event loss becomes realistic.
Meanwhile, despite I cannot reproduce this issue locally, I have prepared a hypothetical fix: #732 — can you take a look, please? Will it fix the issue? Can you please try this patch or branch in your environment?
@nolar
May I ask what is the scale of the cluster you are operating? Probably by the number of resources being operated, only the scale: tens, hundreds, thousands? I am trying to understand in which circumstances this event loss becomes realistic.
The cluster is 24 Nodes each with 96-128 SMT-Cores and 1.5T RAM and in production the operator will probably have to handle 500-1000 resources. But at the day I wrote this issue could reproduce this reliably it happened with like 5 managed resources. That day the various latencies added up in such a way that it took only 2-4 runs to run into the situation.
Meanwhile, despite I cannot reproduce this issue locally, I have prepared a hypothetical fix: #732 — can you take a look, please? Will it fix the issue? Can you please try this patch or branch in your environment?
Yeah, that's basically how my monkey patch is doing it currently. Seems to work.
Good news! In #784, a similar issue was reported with event loss. Unlike here, it was happening with no big scale, at a relatively small scale, but with multiple synchronous functions in async coroutines — which indirectly simulated an extremely laggy network. As a result, I was able to catch it into an artificial snippet with 100% reproducibility even with 1 object involved — see the manual scenario in #732. And after thorough consideration, that hypothetical fix can now be considered as a real and proper fix (though still untestable).
The fix is released together with some other improvements as version 1.33rc1. As with any RC (release candidates), be careful with testing: do not do this on real production or staging clusters, start with local isolated environments if possible. However, the changeset is not big, the risk seems low this time. The expected release time is in 1-2 weeks from now.