iceberg-python icon indicating copy to clipboard operation
iceberg-python copied to clipboard

Support IsolationLevels and Concurrency Safety Validation Checks

Open sungwy opened this issue 1 year ago • 11 comments

Feature Request / Improvement

Support enforcing Isolation Levels from specified snapshot ID

https://iceberg.apache.org/docs/latest/spark-configuration/#write-options

There's been a lot of continued interest in using multiple PyIceberg applications concurrently and having proper support for optimistic concurrency.

I think the best place to start is through the implementation of the individual validation functions

Once this is complete, we'll be able to introduce the Isolation Levels and correctly implement the validation logic in the _OverwriteFiles snapshot producer, similarly to the Java implementation

sungwy avatar Jun 14 '24 18:06 sungwy

Hi I am interested in working on it!

jqin61 avatar Jun 14 '24 20:06 jqin61

Some relevant links to the Java implementation

sungwy avatar Jun 14 '24 20:06 sungwy

Hey @sungwy I would like to contribute by working on these.

Is there any of these that I can pick and starts looking into it like any of the initial validation implementation ?

guptaakashdeep avatar Apr 18 '25 08:04 guptaakashdeep

@guptaakashdeep yes, I don't think there's a particular order we should implement these with, so please feel free to assign yourself to the one you find most interesting!

Sung

sungwy avatar Apr 18 '25 12:04 sungwy

Thanks @sungwy ! Do we have any already existing class where I can implement these Validation functions or should we just add directly in snapshot.py ?

guptaakashdeep avatar Apr 18 '25 13:04 guptaakashdeep

I think we could create a new module as pyiceberg.table.update.validate.py and add these validation checks there. What do you think @guptaakashdeep ?

sungwy avatar Apr 18 '25 13:04 sungwy

Sounds good @sungwy !!

guptaakashdeep avatar Apr 18 '25 15:04 guptaakashdeep

@guptaakashdeep @sungwy see https://github.com/apache/iceberg-python/pull/1935 which should be the building blocks needed to crank out the 4 Sub-issues

jayceslesar avatar Apr 19 '25 01:04 jayceslesar

Also going to crank out a manifest group implementation today

Edit: @sungwy it looks like the manifestgroup.entries method is extremely similar to the DataScan defined in Table __init__.py file...What do you think?

jayceslesar avatar Apr 19 '25 15:04 jayceslesar

Is there any update on this feature?

cnatsis avatar Aug 18 '25 07:08 cnatsis

Also curious if there is any movement on this PR. I currently have some workarounds for concurrent writes implemented but they are very inefficient.

dyami0123 avatar Nov 07 '25 19:11 dyami0123