iceberg-python
iceberg-python copied to clipboard
Distributed writes in the same iceberg transaction
Feature Request / Improvement
I am trying to understand how the new arrow write API can work with distributed writes similar to spark. I have a use case where from different machines, I would like to write a separate arrow dataset that all get committed in the same iceberg transaction. I assume this should be theoretically possible as it works with spark, but I was wondering if there are any plans to support this in the arrow write API. Thanks!
Hey @rahij This is something that we're planning on supporting. I know that the folks at Daft are already working on this. Out of curiosity, how much data are we talking about, and what's the partitioning scheme? Since you probably can hold it in memory, a non-distributed write would also be an option?
This issue has been automatically marked as stale because it has been open for 180 days with no activity. It will be closed in next 14 days if no further activity occurs. To permanently prevent this issue from being considered stale, add the label 'not-stale', but commenting on the issue is preferred when possible.
This issue has been closed because it has not received any activity in the last 14 days since being marked as 'stale'
Any update on supporting distributed write, we are also interested in adding iceberg write capability to Ray. https://github.com/ray-project/ray/issues/49032