iceberg-python Distributed writes in the same iceberg transaction

Distributed writes in the same iceberg transaction

Open rahij opened this issue 1 year ago • 2 comments

Feature Request / Improvement

I am trying to understand how the new arrow write API can work with distributed writes similar to spark. I have a use case where from different machines, I would like to write a separate arrow dataset that all get committed in the same iceberg transaction. I assume this should be theoretically possible as it works with spark, but I was wondering if there are any plans to support this in the arrow write API. Thanks!

Feb 03 '24 10:02 rahij

Hey @rahij This is something that we're planning on supporting. I know that the folks at Daft are already working on this. Out of curiosity, how much data are we talking about, and what's the partitioning scheme? Since you probably can hold it in memory, a non-distributed write would also be an option?

Feb 05 '24 20:02 Fokko

This issue has been automatically marked as stale because it has been open for 180 days with no activity. It will be closed in next 14 days if no further activity occurs. To permanently prevent this issue from being considered stale, add the label 'not-stale', but commenting on the issue is preferred when possible.

Aug 04 '24 00:08 github-actions[bot]

This issue has been closed because it has not received any activity in the last 14 days since being marked as 'stale'

Aug 18 '24 00:08 github-actions[bot]

Any update on supporting distributed write, we are also interested in adding iceberg write capability to Ray. https://github.com/ray-project/ray/issues/49032

Dec 11 '24 22:12 jimmyxie-figma

iceberg-python iceberg-python copied to clipboard

Distributed writes in the same iceberg transaction

Feature Request / Improvement

iceberg-python
iceberg-python copied to clipboard