Description

Finally making an issue about this for discussion. I have a draft PR of reading CDF available already in #2048 which still needs work on it's reader side but I wanted to more discuss the writer side of it as it will take much more refactoring. I wanted to also ask how we wanted to approach this. The reader can be mostly encapsulated as it's own thing where the writer will touch all writing operations in delta lake. Do w e want to roll these into the same PR, or make subsequent PRs? I think subsequent would be better, but just my opinion.

So for reader, it's first stages are in-flight with #2048, I just need to figure out how I want to validate the correctness of this. @MrPowers maybe you can help me figure that out?

For writers, well there is a bit more to do here. CDC actions have to be added to the commit log along with the subsequent add/remove actions, generating additional change data files in the _change_data directory of a delta table. Currently we encapsulate the builders of these operations in such a way that the builder builds and commits all the actions itself without giving any ability for features to influence the actions of the commit before it's written. So, in order to make CDF work we would be required to update every action and add it's subsequent CDF aware functionality to the operation. I'd argue this would only exacerbate the current issue with builders owning the entire life cycle of the operation and we should not do this.

I would instead suggest that we refactor the builders to only create and return a list of actions to commit and a snapshot to commit to. Then let a subsequent (maybe global) part of the code do the actual commit. This way you can compose operations in a more maintainable way. I spoke with @r3stl355 about this as well because some of the work he did for replaceWhere would have been another good candidate to benefit from this type of rethinking. So for writers I am proposing we take this approach.

Refactor builders internals to build sets of actions to commit and not perform the actual commit.
Add a central place to actually perform commit / write / update the table
Implement subsequent CDF compositions on top of the new operations

This will benefit us in the sense that CDF's implementation will have no effect on what those other operations do. Only augment them, so generally our implementations of these features will be more resilient to mistakes as we implement more features down the line. Things like row-tracking come to mind when thinking about potential issues down the line as row-tracking has a specific clause for readers regarding CDF files. Additionally checkpoints must specifically go the opposite way and remove CDF from their checkpointing. I linked these under the related issues, but hopefully that makes sense.

Use Case https://delta.io/blog/2023-07-14-delta-lake-change-data-feed-cdf/ https://docs.delta.io/latest/delta-change-data-feed.html https://github.com/delta-io/delta/blob/master/PROTOCOL.md#add-cdc-file

Related Issue(s) https://github.com/delta-io/delta/blob/master/PROTOCOL.md#reader-requirements-for-row-tracking https://github.com/delta-io/delta/blob/master/PROTOCOL.md#checkpoints-1

Jan 21 '24 16:01 hntd187

take

Jan 21 '24 16:01 hntd187

I think that ties in well what @Blajda proposes where he is also pointing out to reuse components across the different operations https://github.com/delta-io/delta-rs/issues/2006. Maybe it's good to move to logical plans first throughout the API before we start working on CDF then

Jan 21 '24 17:01 ion-elgreco

delta-rs
delta-rs copied to clipboard

Change Data Feed in Delta

Description

delta-rs delta-rs copied to clipboard

Change Data Feed in Delta

Description

delta-rs
delta-rs copied to clipboard