iceberg-python icon indicating copy to clipboard operation
iceberg-python copied to clipboard

Distributed writes in the same iceberg transaction

Open rahij opened this issue 1 year ago • 2 comments

Feature Request / Improvement

I am trying to understand how the new arrow write API can work with distributed writes similar to spark. I have a use case where from different machines, I would like to write a separate arrow dataset that all get committed in the same iceberg transaction. I assume this should be theoretically possible as it works with spark, but I was wondering if there are any plans to support this in the arrow write API. Thanks!

rahij avatar Feb 03 '24 10:02 rahij

Hey @rahij This is something that we're planning on supporting. I know that the folks at Daft are already working on this. Out of curiosity, how much data are we talking about, and what's the partitioning scheme? Since you probably can hold it in memory, a non-distributed write would also be an option?

Fokko avatar Feb 05 '24 20:02 Fokko

This issue has been automatically marked as stale because it has been open for 180 days with no activity. It will be closed in next 14 days if no further activity occurs. To permanently prevent this issue from being considered stale, add the label 'not-stale', but commenting on the issue is preferred when possible.

github-actions[bot] avatar Aug 04 '24 00:08 github-actions[bot]

This issue has been closed because it has not received any activity in the last 14 days since being marked as 'stale'

github-actions[bot] avatar Aug 18 '24 00:08 github-actions[bot]

Any update on supporting distributed write, we are also interested in adding iceberg write capability to Ray. https://github.com/ray-project/ray/issues/49032

jimmyxie-figma avatar Dec 11 '24 22:12 jimmyxie-figma