delta-rs icon indicating copy to clipboard operation
delta-rs copied to clipboard

feat: introduce CDC write-side support for the Update operations

Open rtyler opened this issue 1 year ago • 2 comments

This change introduces a CDCTracker which helps collect changes during merges and update. This is admittedly rather inefficient, but my hope is that this provides a place to start iterating and improving upon the writer code

There is still additional work which needs to be done to handle table features properly for other code paths (see the middleware discussion we have had in Slack) but this produces CDC files for Update operations

Fixes #604 Fixes #2095

rtyler avatar May 07 '24 07:05 rtyler

I think it's better to disable it until all operations (delete and merge) are supported, otherwise we cannot push any python releases until those are added

ion-elgreco avatar May 09 '24 09:05 ion-elgreco

I think it's better to disable it until all operations (delete and merge) are supported, otherwise we cannot push any python releases until those are added

How would you disable this? It doesn't make sense to me to include short-term configuration or feature flags to me. The protocol states that when the enable change data feed table-feature is enabled, that writers can optionally produce CDC files. Our writers just optionally will only create them on updates for now :laughing:

rtyler avatar May 12 '24 18:05 rtyler

Doing further acceptance testing I have identified what I believe to be a bug in DataFusion and will put this into Draft until I can figure out the path forward

apache/datafusion#10749

rtyler avatar Jun 01 '24 17:06 rtyler

In discussion with @ion-elgreco , due to apache/datafusion#10749 which is really an issue with arrow-rs. We decided that we can move forward without struct/list CDC working with the following conditions:

  • warn if the schema has a struct or a list in it when attempting to create CDC batches, to try to make it clear to the user that this is not yet supported
  • generated column support doesn't exist yet, but as long as delta.generatedExpression is not set, it's safe to allow that table feature to exist (code must be modified).

rtyler avatar Jun 03 '24 15:06 rtyler