register icon indicating copy to clipboard operation
register copied to clipboard

PSC-STM-B6: Add tracking of which records have been transformed

Open tiredpixel opened this issue 1 year ago • 0 comments

It is not ideal to process the same records multiple times, since it may keep replacing statements.

When we are consuming from S3, we only transform each file once, and when from a Kinesis stream, we keep track of our stream pointer, so this doesn’t happen much in practice. However, when switching from bulk files over to the Kinesis stream, there is a danger of 48 hours of records or so being processed more than once.

To fix this, it would make sense to keep track of the records transformed in the previous 48 hours, so these can be safely skipped.

  • When a record has been transformed, store the etag of the processed PSC record for some length of time longer than max stream duration (eg store for 48 hours)
  • When transforming a PSC record, first check whether it has been transformed in the last 48 hours.

This will ensure that the same records don’t get processed multiple times in cases of duplicates or during the changeover.

Estimate: 6 hours

tiredpixel avatar Feb 28 '24 16:02 tiredpixel