feat: introduce CDC write-side support for the Update operations
This change introduces a CDCTracker which helps collect changes during merges and update. This is admittedly rather inefficient, but my hope is that this provides a place to start iterating and improving upon the writer code
There is still additional work which needs to be done to handle table features properly for other code paths (see the middleware discussion we have had in Slack) but this produces CDC files for Update operations
Fixes #604 Fixes #2095
I think it's better to disable it until all operations (delete and merge) are supported, otherwise we cannot push any python releases until those are added
I think it's better to disable it until all operations (delete and merge) are supported, otherwise we cannot push any python releases until those are added
How would you disable this? It doesn't make sense to me to include short-term configuration or feature flags to me. The protocol states that when the enable change data feed table-feature is enabled, that writers can optionally produce CDC files. Our writers just optionally will only create them on updates for now :laughing:
Doing further acceptance testing I have identified what I believe to be a bug in DataFusion and will put this into Draft until I can figure out the path forward
apache/datafusion#10749
In discussion with @ion-elgreco , due to apache/datafusion#10749 which is really an issue with arrow-rs. We decided that we can move forward without struct/list CDC working with the following conditions:
- warn if the schema has a struct or a list in it when attempting to create CDC batches, to try to make it clear to the user that this is not yet supported
- generated column support doesn't exist yet, but as long as
delta.generatedExpressionis not set, it's safe to allow that table feature to exist (code must be modified).