deltacat icon indicating copy to clipboard operation
deltacat copied to clipboard

Always persist a high watermark for source table

Open valiantljk opened this issue 1 year ago • 2 comments

Current high watermark links to the partition locator. During incremental compaction, we expect two source of data as input. One is the compacted table, the other is the new delta. In rare cases, where no new delta exist. Only compacted table will go through the delta discovery and entire compaction. In the end, the high watermark recorded in round completion file is only from the compacted table.

In next round, when we retrieve the high watermark from round completion file, we are not able to get the high watermark of the source table, in some cases, we call it old_parent_stream_position.

Two options:

  • get the high watermark from delta property when rcf doesn't have it
  • persist the high watermark for source table in rcf

valiantljk avatar Jun 20 '23 17:06 valiantljk

Since there's nothing to update aside from metadata in the case of no new deltas, are we also ensuring that we're not running through all data processing steps of hash bucketing, dedupe, and materialize?

pdames avatar Jun 20 '23 17:06 pdames

Currently, it'll still go through the steps. We don't have a direct copy route yet. It seems to be a corner case.

valiantljk avatar Jun 20 '23 23:06 valiantljk