arrow2 icon indicating copy to clipboard operation
arrow2 copied to clipboard

arrow<->arrow2 interopability conversion method?

Open dbr opened this issue 2 years ago • 11 comments

I'm using arrow2 in a project along with polars. I now also want to send the same data to datafusion, which uses arrow (ideally without having to send the data through via the IPC serialization format)

Before I fumble around with the FFI API, I figured I would check first: is there a method somewhere which handles the conversion of a RecordBatch from arrow to arrow2 and vice versa? Seems like something that might exist already in some kind of integration test or similar, but "arrow<->arrow2" is quite a tricky thing to search for

dbr avatar Nov 24 '21 01:11 dbr

Nop, we do not have such a methods. That is because doing so requires arrow depending on arrow2 or vice-versa.

The arrow format was designed exactly to support this case, though, and the FFI is really the conversion. What happens is that a RecordBatch is not part of arrow in-memory specification, it is just a struct that exists in some implementations.

jorgecarleitao avatar Nov 24 '21 05:11 jorgecarleitao

There could be a crate one level higher that implements the conversion.

    arrow-conv
         |
        /\
      /    \
arrow      arrow2 

It could maybe support conversion of the core types, ArrayRef and RecordBatch.

ritchie46 avatar Nov 27 '21 19:11 ritchie46

off topic, but in case you missed it, there is also a fairly uptodate arrow2 branch for datafusion that's being worked on.

houqp avatar Nov 27 '21 19:11 houqp

Hi @houqp , may I ask what branch is that? and is that going to be under apache or also as part of a separate repo?

renato2099 avatar Dec 30 '21 20:12 renato2099

I think it means https://github.com/apache/arrow-datafusion/pull/68

alamb avatar Jan 02 '22 16:01 alamb

doing so requires arrow depending on arrow2 or vice-versa.

This could easily be feature-gated though, so there is no hard dependency. The FFI works but is very awkward to do manually since it involves working with raw pointers, versus the lovely abstraction of just calling .into().

multimeric avatar Feb 18 '22 15:02 multimeric

This could easily be feature-gated though, so there is no hard dependency

True, although I think there is some benefits to having it be a separate crate, mainly the releases wouldn't be tied to any either project's release cycle (i.e a new release of the conv crate could be made whenever arrow or arrow2 is released)

It also could allow the crate to work on multiple versions of either version via feature flags (similar to how multiple winit versions are handled here)

The crate could also support the pyarrow interop via PyO3, which would have the same benefits (e.g currently arrow-rs 9 only supports pyo3 0.15, and we are blocked on updating pyo3 until next release of arrow)

Biggest drawback of it being a separate crate would be you can't as neatly implement some conversion traits (since they would be forgien types)

dbr avatar Mar 07 '22 01:03 dbr

I think a conversion crate would be valuable indeed, though I don't have time to work on such a thing now.

We could potentially host it in the https://github.com/datafusion-contrib organization and there might be others in the community who are interested in helping -- see the arrow2 milestone https://github.com/apache/arrow-datafusion/milestone/3

alamb avatar Mar 07 '22 11:03 alamb

designing a bit, I think that we would need 6 functions:

pub fn arrow_to_arrow2_error(error: arrow::error::ArrowError) -> arrow2::error::ArrowError;
pub fn arrow_to_arrow2_field(field: arrow::datatypes::Field) -> Result<arrow2::datatypes::Field, arrow2::error::ArrowError>;
pub fn arrow_to_arrow2_array(array: Arc<dyn arrow::array::Array>) -> Result<Arc<dyn arrow2::array::Array>, arrow2::error::ArrowError>;

pub fn arrow2_to_arrow_error(error: arrow2::error::ArrowError) -> arrow::error::ArrowError;
pub fn arrow2_to_arrow_field(field: arrow2::datatypes::Field) -> Result<arrow::datatypes::Field, arrow::error::ArrowError>;
pub fn arrow2_to_arrow_array(array: Arc<dyn arrow2::array::Array>) -> Result<Arc<dyn arrow::array::Array>, arrow::error::ArrowError>;

jorgecarleitao avatar Mar 07 '22 18:03 jorgecarleitao

You miss out on a valid naming option: arrow2arrow :stuck_out_tongue_winking_eye:

ritchie46 avatar Mar 07 '22 18:03 ritchie46

You miss out on a valid naming option: arrow2arrow 😜

🤯 lol

alamb avatar Mar 07 '22 19:03 alamb