arrow2 Higher level FFI abstraction

Higher level FFI abstraction

Open multimeric opened this issue 2 years ago • 8 comments

A certain other Arrow crate has a lovely method on the Array that returns two pointers to FFI Array and Schema. Currently arrow2 does not have this, so users have to delve into the ugliness of creating a Box and coercing it into a pointer. In practise all of that stuff is just boilerplate, so it would be nicer to be part of arrow2 to avoid users writing and rewriting it whenever they actually want to do Arrow interop.

Feb 18 '22 16:02 multimeric

I agree with the sentiment that the API is not ideal. However, my understanding is that the C data interface flow is expected to be as follows (see Member allocation):

FFI structs are allocated by the consumer
the pointers are passed to the producer, which populates them (including the release callback)
the consumer reads from those pointers
the consumer calls release when they are no longer needed

Array -> pointer assumes that the producer is the one allocating the FFI struct, which imho deviates from the spec.

Feb 18 '22 16:02 jorgecarleitao

If that is the case, could we not provide a method that still does this, but takes two pointers as arguments?

Feb 18 '22 16:02 multimeric

That is a good idea. In this example we tried to illustrate this. Maybe that function should be moved to the crate?

Also, note how the field's name and nullability is not really needed; we only need the data_type.

Feb 18 '22 17:02 jorgecarleitao

Yes, making them part of the public API would be quite helpful. But also there's still the issue of the array allocation for import into Rust, ie this stuff: https://github.com/jorgecarleitao/arrow2/blob/c012c9d3095b03877bd5f610d4db746b728e9cba/examples/ffi.rs#L29-L37

I wonder if we could also create a function that just does this, and returns a tuple of (array_ptr, schema_ptr) which users can then pass into the function provided by other arrow implementations?

Feb 18 '22 23:02 multimeric

Let me try to explain the rational for this API.

There is one, and only one source of unsoundness in FFI: that an implementation fails to meet the C data interface. This invariant is crucial in this region of the code base. Because FFI's interface is done via pointers, and pointers have very little compile time information about their origin, there is a second potential source of unsoundness here - that the user forgets about some invariant of the pointers.

The reason for the current API is that it makes it obvious that the call

https://github.com/jorgecarleitao/arrow2/blob/c012c9d3095b03877bd5f610d4db746b728e9cba/examples/ffi.rs#L44-L47

is safe (we just allocated them via a Box 5 lines above). If we write

let (array_ptr, schema_ptr) = arrow2::ffi::new_pointers();

...

let array_ptr = unsafe { Box::from_raw(array_ptr) };
let schema_ptr = unsafe { Box::from_raw(schema_ptr) };

it is now unclear why this particular call is safe; the user needs to know from the documentation that the pointers from new_pointers are originated from a Box.

If instead we offer the function

pub unsafe fn import_array(array: *mut Ffi_ArrowArray, schema: *mut Ffi_ArrowSchema) -> Result<Box<dyn Array>>

this unsafe now encompasses two different invariants:

the pointers were allocated by Box (typing)
the data in the structs is valid Arrow (specification)

The former is safeguarded in Rust by strong typing, since it is quite easy to get it wrong (specially when implementations do offer APIs allocate their own pointers)
the latter is related to whether the other implementation fulfills the arrow specification, that is outside of the scope of Rust's typing or borrow checker.

The current API is catering for these two independent invariants by giving control over the pointers' lifetime to the user, using Rust typing to guide the user.

I do agree that we can improve the API, I am just pointing out that we should not do so at the expense of a higher human RAM usage about pointers origin. :)

Feb 19 '22 05:02 jorgecarleitao

I don't follow your whole argument, but it seems that you are mostly interested in keeping this API so that programmers can convince themselves that the export is safe. I don't particular subscribe to this philosophy, because we are already blindly trusting a lot of things about any given crate. I would rather the unsafe part be largely wrapped in solid library code rather than forcing the intellectual burden and also some safely risk onto the user.

Feb 19 '22 07:02 multimeric

I would rather the unsafe part be largely wrapped in solid library code

While I agree with this statement, but in this case, you can't warp the whole unsafe part in library code, (the Box::from_raw part). Blindly trusting 3rd-party crates is one thing, having to know the internal of a 3rd-party crate is another thing, especially relying an invariant the crates does not guarantee. I would argue warp half unsafe block into library code will increase the intellectual burden.

Besides safety concerns, passing a pointer increases flexibility. This allows the pointer not comes from a Box<T> but a Box<[T]> or Arc<T>, which are totally valid use cases.

Mar 10 '22 09:03 zirconium-n

I would rather the unsafe part be largely wrapped in solid library code rather than forcing the intellectual burden and also some safely risk onto the user.

I think #629 may help here? It basically describes a high-level, safe wrapper around the FFI between arrow and arrow2

Mar 15 '22 09:03 dbr

arrow2 arrow2 copied to clipboard

Higher level FFI abstraction

arrow2
arrow2 copied to clipboard