kafka-flow icon indicating copy to clipboard operation
kafka-flow copied to clipboard

Partition ownership overlap during unideal circumstances

Open tobiajo opened this issue 4 months ago • 1 comments
trafficstars

Description

I have observed, that during unideal circumstances, like Kafka maintenance, network issues or many restarts during high load. There is a possiblity of having multiple activate PartitionFlow for the same partition and therefore multiple snapshot store writers for the same key. In worst case this can lead to newer snapshots being overridden by older snapshots.

On consumer side there is proper partition ownership gauranteed by Kafka, but when a partition is revoked on session timeout, there is a delay until the previous owner gets aware of it and the partition flow is stopped. In this time window there is a risk of introduced data corruption. First when the previous owner polls or commit, it will discover being kicked out.

Scenario

Here is a figure describing what can happen:

sequenceDiagram
    autonumber

    participant inputTopic
    participant nodeA
    participant nodeB
    participant nodeC
    participant snapshotStore

    inputTopic-->>+nodeA: assign partition 3
    nodeA->>snapshotStore: read
    snapshotStore-->>nodeA: state_0
    nodeA->>inputTopic: poll events
    inputTopic-->>nodeA: seqNr 1 to 5
    nodeA->>nodeA: state_5 = state_0 + events seqNr 1 to 5
    Note over inputTopic, nodeA: unideal circumstances start,<br/>nodeA revoked from partition 3

    inputTopic-->>+nodeB: assign partition 3
    nodeB->>snapshotStore: read
    snapshotStore-->>nodeB: state_0
    nodeB->>inputTopic: poll events
    inputTopic-->>nodeB: seqNr 1 to 5
    nodeB->>nodeB: state_5 = state_0 + events seqNr 1 to 5
    nodeB->>inputTopic: poll events
    inputTopic-->>nodeB: seqNr 6 to 10
    nodeB->>nodeB: state_10 = state_5 + events seqNr 6 to 10
    nodeB->>snapshotStore: write state_10
    nodeB->>inputTopic: commit seqNr 10

    nodeA->>snapshotStore: ⚠️ write state_5
    nodeA->>inputTopic: ❌ commit seqNr 5
    inputTopic-->>nodeA: not part of active group
    nodeA->>-nodeA: stop partition flow

    nodeB->>-nodeB: stop partition flow

    inputTopic-->>+nodeC: assign partition 3
    nodeC->>snapshotStore: read
    snapshotStore-->>nodeC: state_5
    nodeC->>inputTopic: poll events
    inputTopic-->>nodeC: seqNr 11 to 15
    nodeC->>-nodeC: 💥 corrupt = state_5 + events seqNr 11 to 15

In a very extreme case, similar to the scenario in the figure, I observed that nodeA stopped the partition flow first 45 seconds after nodeB started it, for the same partition. As nodeA does not become aware being kicked out before trying to commit offset (19) and getting an error response (20).

Solution

I don't have any straightforward solution to the problem, but the main issue that needs to be prevented is stale writes. Here are some ideas:

  1. Transactional snapshot write + offset commit If a snapshot could only be written when the consumer can commit, it would ensure that only the proper partition owner can write.
  2. Transactional snapshot read + snapshot write If the value in snapshot store is asserted to be the previous, stale writes could be prevented.
  3. Offset tracking in snapshot If snapshots contained offset, on recovery the lowest offset accross all keys in the partition could be used.
  4. Distributed lock If partition ownership could be fully exclusive through some lease, multiple writers would be impossible.

Workarounds

In the meanwhile, I would appriciate any ideas for workarounds. The only thing I can I can think of is using static partition assignments, to make it impossible that multiple nodes are active for the same partition.

tobiajo avatar Jul 18 '25 11:07 tobiajo