Support IsolationLevels and Concurrency Safety Validation Checks
Feature Request / Improvement
Support enforcing Isolation Levels from specified snapshot ID
https://iceberg.apache.org/docs/latest/spark-configuration/#write-options
There's been a lot of continued interest in using multiple PyIceberg applications concurrently and having proper support for optimistic concurrency.
I think the best place to start is through the implementation of the individual validation functions
Once this is complete, we'll be able to introduce the Isolation Levels and correctly implement the validation logic in the _OverwriteFiles snapshot producer, similarly to the Java implementation
Hi I am interested in working on it!
Some relevant links to the Java implementation
- validateNewDataFiles flag -> MergingSnapshotProducer: validateAddedDataFiles
- validateNewDeletes flag -> MergingSnapshotProducer: validateDeletedDataFiles
Hey @sungwy I would like to contribute by working on these.
Is there any of these that I can pick and starts looking into it like any of the initial validation implementation ?
@guptaakashdeep yes, I don't think there's a particular order we should implement these with, so please feel free to assign yourself to the one you find most interesting!
Sung
Thanks @sungwy ! Do we have any already existing class where I can implement these Validation functions or should we just add directly in snapshot.py ?
I think we could create a new module as pyiceberg.table.update.validate.py and add these validation checks there. What do you think @guptaakashdeep ?
Sounds good @sungwy !!
@guptaakashdeep @sungwy see https://github.com/apache/iceberg-python/pull/1935 which should be the building blocks needed to crank out the 4 Sub-issues
Also going to crank out a manifest group implementation today
Edit: @sungwy it looks like the manifestgroup.entries method is extremely similar to the DataScan defined in Table __init__.py file...What do you think?
Is there any update on this feature?
Also curious if there is any movement on this PR. I currently have some workarounds for concurrent writes implemented but they are very inefficient.