materialize icon indicating copy to clipboard operation
materialize copied to clipboard

Reduce Debezium cost with idempotency key

Open chuck-alt-delete opened this issue 8 months ago • 2 comments

Feature request

Debezium offers before and after images for record updates which we can use to issue diffs in Materialize. The issue is that there are certain failure scenarios that completely break this mechanism, eg

  1. Forgotten deletes. Debezium will not be able to see deletes while it is down during a snapshot. So Materialize will have this record even though it has been deleted, which is incorrect.
  2. Duplicates. Restarts in debezium can cause duplicates, which means the after image of a record will not match the before image of the next record of the same key, leading to an errored source in Materialize.

As far as I know, 1 is not solved by the current upsert operator anyway, so we can exclude it from this discussion.

For 2, we currently use an upsert operator to deduplicate and create a canonical before/after image for each key that we use to issue good diffs. This comes at a cost — MZ has to remember this information for all keys at all times.

In practice, duplication happens very rarely and in a specific way. Duplicates don’t happen at random times — they often within a very limited timeframe. If we had a short-lived (say, 1 hour) idempotency key that rejects records that were recently received, we would eliminate virtually all duplicates without paying such a high memory cost.

chuck-alt-delete avatar Jun 24 '24 15:06 chuck-alt-delete