pinot icon indicating copy to clipboard operation
pinot copied to clipboard

Incorrect deletion of valid upsert segments due to replica-level validDocIds inconsistency during merge-compaction and server restarts

Open tarun11Mavani opened this issue 3 months ago • 5 comments

Description:

We observed data loss in a FULL upsert table caused by an incorrect segment-deletion decision during the UpsertMergeCompactionTask. Two segments (S1, S2) existed across two servers and were merged into a new segment (MS1). Due to async OFFLINE→ONLINE state transitions during a server restart, tie-breaking (shouldReplaceOnComparisonTie) caused validDocIds to diverge between replicas:

S1 (creation time T1) and S2 (creation time T2) was merged and we created a new merged segment MS1 with creation time T2 (max of T1 and T2). S1 and S2 were already compacted in the past runs and were present as uploaded segments earlier.

  • Node1 retained valid records in S2 and MS1. Due to MS1 and S2 having same creationTime, records in S2 were retained while all records from S1 were invalided and marked valid in MS1.
  • Node2 retained valid records in S2 and MS1. Due to MS1 and S2 having same creationTime, records in S2 were retained while all records from S1 were invalided and marked valid in MS1.
  • After server replacement on Node2, MS1 was loaded before S1 and S2. So Node2 retained valid records only in MS1; S1 & S2 appeared fully invalid
  • A subsequent doSnapshot() persisted 0 validDocIds for S2 on Node2. Because the merge-compaction task deletes a segment if any replica reports totalInvalidDoc == totalDocs, S2 was incorrectly deleted from both replicas, even though Node1 still contained valid PKs.

Result: Permanent data loss of valid primary keys on one node.

Fixes

  • Instead of checking if one of the replicas has zero validDocIds, we should check that all replicas have 0 validDocIds here.
  • We upload a new segment with creationTime = max (creationTime of all merging segments). This causes the merged segment and largest segment to have same creation time and all records in largest merging segment is replaced with merging segment due to tiebrealking logic in shouldReplaceOnComparisonTie. We should set the creationTime = max (creationTime of all merging segments) + 1 to avoid this and fully replace all records from merging segments with new merged segment.

tarun11Mavani avatar Dec 09 '25 08:12 tarun11Mavani

Example: There are two replica of segments on Node1 and Node2. Number inside () shows the number of validDocs.

Image

tarun11Mavani avatar Dec 09 '25 11:12 tarun11Mavani

Instead of checking if one of the replicas has zero validDocIds, we should check that all replicas have 0 validDocIds here.

Checking both replicas for validDocIds to be 0 before selecting for deletion in minion is a quick safety fix. We need to handle the scenario when the other replica is unavailable/restarting, it will be better to skip it rather than select for deletion

rohityadav1993 avatar Dec 10 '25 05:12 rohityadav1993

Another thought here to avoid validDocIds snapshot deviating between replicas can be controlled by how segments are loaded, if we can ensure/is possible to load newly created segments first the behaviour will be consistent in both the replicas

rohityadav1993 avatar Dec 10 '25 05:12 rohityadav1993

For this,

We should set the creationTime = max (creationTime of all merging segments) + 1 to avoid this and fully replace all records from merging segments with new merged segment.

Can you verify this following edge case :

  • I have segments S1 and S2, and a merged segment MS1 that needs to be uploaded. Before it's uploaded, some PKs in the segments S1 and S2 are invalidated by a new segment S3.
  • Now if we upload the new merged segment MS1, does it invalidate PKs of S3 in some rare case ? eg. if the time clashes.

anuragrai16 avatar Dec 15 '25 05:12 anuragrai16

For this,

We should set the creationTime = max (creationTime of all merging segments) + 1 to avoid this and fully replace all records from merging segments with new merged segment.

Can you verify this following edge case :

  • I have segments S1 and S2, and a merged segment MS1 that needs to be uploaded. Before it's uploaded, some PKs in the segments S1 and S2 are invalidated by a new segment S3.
  • Now if we upload the new merged segment MS1, does it invalidate PKs of S3 in some rare case ? eg. if the time clashes.

If the new records in S3 have a higher comparison values than S1 or S2, it will continue to stay even after merged segment is uploaded as the comparison column value takes precedence.

If the record in S3 has the same comparison value as older record, record in S1 and S2 will still get replaced with record in S3. Later when merged segment is uploaded with creation time = (creation time of S2 + 1), it will still be lower than the creation time of S3 so the record in S3 will remain valid. You can read about the preference logic here

tarun11Mavani avatar Dec 15 '25 05:12 tarun11Mavani