arrow2
arrow2 copied to clipboard
Higher level FFI abstraction
A certain other Arrow crate has a lovely method on the Array
that returns two pointers to FFI Array and Schema. Currently arrow2 does not have this, so users have to delve into the ugliness of creating a Box
and coercing it into a pointer. In practise all of that stuff is just boilerplate, so it would be nicer to be part of arrow2
to avoid users writing and rewriting it whenever they actually want to do Arrow interop.
I agree with the sentiment that the API is not ideal. However, my understanding is that the C data interface flow is expected to be as follows (see Member allocation):
- FFI structs are allocated by the consumer
- the pointers are passed to the producer, which populates them (including the release callback)
- the consumer reads from those pointers
- the consumer calls release when they are no longer needed
Array -> pointer
assumes that the producer is the one allocating the FFI struct, which imho deviates from the spec.
If that is the case, could we not provide a method that still does this, but takes two pointers as arguments?
That is a good idea. In this example we tried to illustrate this. Maybe that function should be moved to the crate?
Also, note how the field's name and nullability is not really needed; we only need the data_type.
Yes, making them part of the public API would be quite helpful. But also there's still the issue of the array allocation for import into Rust, ie this stuff: https://github.com/jorgecarleitao/arrow2/blob/c012c9d3095b03877bd5f610d4db746b728e9cba/examples/ffi.rs#L29-L37
I wonder if we could also create a function that just does this, and returns a tuple of (array_ptr, schema_ptr)
which users can then pass into the function provided by other arrow implementations?
Let me try to explain the rational for this API.
There is one, and only one source of unsoundness in FFI: that an implementation fails to meet the C data interface. This invariant is crucial in this region of the code base. Because FFI's interface is done via pointers, and pointers have very little compile time information about their origin, there is a second potential source of unsoundness here - that the user forgets about some invariant of the pointers.
The reason for the current API is that it makes it obvious that the call
https://github.com/jorgecarleitao/arrow2/blob/c012c9d3095b03877bd5f610d4db746b728e9cba/examples/ffi.rs#L44-L47
is safe (we just allocated them via a Box 5 lines above). If we write
let (array_ptr, schema_ptr) = arrow2::ffi::new_pointers();
...
let array_ptr = unsafe { Box::from_raw(array_ptr) };
let schema_ptr = unsafe { Box::from_raw(schema_ptr) };
it is now unclear why this particular call is safe; the user needs to know from the documentation that the pointers from new_pointers
are originated from a Box
.
If instead we offer the function
pub unsafe fn import_array(array: *mut Ffi_ArrowArray, schema: *mut Ffi_ArrowSchema) -> Result<Box<dyn Array>>
this unsafe
now encompasses two different invariants:
- the pointers were allocated by
Box
(typing) - the data in the structs is valid Arrow (specification)
- The former is safeguarded in Rust by strong typing, since it is quite easy to get it wrong (specially when implementations do offer APIs allocate their own pointers)
- the latter is related to whether the other implementation fulfills the arrow specification, that is outside of the scope of Rust's typing or borrow checker.
The current API is catering for these two independent invariants by giving control over the pointers' lifetime to the user, using Rust typing to guide the user.
I do agree that we can improve the API, I am just pointing out that we should not do so at the expense of a higher human RAM usage about pointers origin. :)
I don't follow your whole argument, but it seems that you are mostly interested in keeping this API so that programmers can convince themselves that the export is safe. I don't particular subscribe to this philosophy, because we are already blindly trusting a lot of things about any given crate. I would rather the unsafe part be largely wrapped in solid library code rather than forcing the intellectual burden and also some safely risk onto the user.
I would rather the unsafe part be largely wrapped in solid library code
While I agree with this statement, but in this case, you can't warp the whole unsafe part in library code, (the Box::from_raw
part). Blindly trusting 3rd-party crates is one thing, having to know the internal of a 3rd-party crate is another thing, especially relying an invariant the crates does not guarantee. I would argue warp half unsafe block into library code will increase the intellectual burden.
Besides safety concerns, passing a pointer increases flexibility. This allows the pointer not comes from a Box<T>
but a Box<[T]>
or Arc<T>
, which are totally valid use cases.
I would rather the unsafe part be largely wrapped in solid library code rather than forcing the intellectual burden and also some safely risk onto the user.
I think #629 may help here? It basically describes a high-level, safe wrapper around the FFI between arrow and arrow2