[SEDONA-723] Add write format for (Geo)Arrow

Open paleolimbot opened this issue 9 months ago • 0 comments

Did you read the Contributor Guide?

Yes, I have read the Contributor Rules and Contributor Development Guide

Is this PR related to a ticket?

Yes, and the PR name follows the format [SEDONA-723] my subject.

What changes were proposed in this PR?

This PR is intended to add df.write.format("arrows") when complete (but is currently just an exploration of this idea.

How was this patch tested?

It will be with tests in Java (if this change seems worth it!)

Did this PR include necessary documentation updates?

Yes, I am adding a new API (and will update docs if this idea is accepted!)

In SEDONA-660, SEDONA-714, and SEDONA-717, we wired up the ArrowSerializer from SparkConnect to accelerate transfer between the JVM and Python on the driver. For queries whose results are arbitrarily large or unknown at the time of issuing the query, this can result in out-of-memory and it would be helpful to have an escape hatch. This is also a useful way for Sedona users to build services on top of Sedona (e.g., by returning the URLs to the written Arrow files as described in https://arrow.apache.org/blog/2025/01/10/arrow-result-transfer/ ).

Mar 18 '25 19:03 paleolimbot