ndarray
ndarray copied to clipboard
Adoption of the Apache Arrow memory alignment and padding?
Hi,
I'm just trying to get a sense of the level of interest from the ndarray developers regarding adopting the Apache Arrow memory layout and padding.
I have been wanting to build integrations between Arrow and ndarray for some time. Today it should be easy enough to build a zero-copy converter to ndarray types. Arrow has a tensor type and this could be converted (with the optional names for dimensions in Arrow dropped).
However, without guarantees over the memory alignment and padding assumptions you could not go back to Arrow with zero-copy. The easiest way to do this would be for ndarray to use the Arrow functions that allocate memory through the Arrow Buffer type.
Arrow is attempting to make integrations between crates easier, I noticed this issue today. This is the kind of issue we could avoid.
In general, I think that Arrow and ndarray fit together quite nicely where Arrow could provide alot of help processing data and ndarray provides all the algorithms once data is cleaned and in-memory.
I'm not very familiar with the ndarray codebase, if this sounds like a good idea could you point me to where you allocate memory etc. and any other information that might help?
@paddyhoran if this happens would it make sense to use ndarray-stats for Arrow compute kernels? (I hope this is the good name for min, max, sum, count etc functions).
@paddyhoran if this happens would it make sense to use ndarray-stats for Arrow compute kernels?
Yes, perhaps. We started to implement some compute kernels in Arrow and using SIMD,etc. However, the Rust community is relatively small to begin with. ndarray
is the most established numerical library in the Rust ecosystem so if we had seemless integration between the two it would have a lot of mutual benefits, for instance:
- people could contribute to
ndarray-stats
instead of building a stats package that works with Arrow -
ndarray
leveraging Arrow to useparquet
, Arrow Flight, etc.
There are probably some things that we need to resolve, for instance:
- handling of Nulls, Arrow has a bitmap, does
ndarray
use a sentinel value? - Arrow Boolean Arrays are bit packed,
ndarray
's might not
I think integration between the two could open up a lot of possibilities on both sides but I'm not sure how the ndarray
devs feel?
Looking at this purely from an ecosystem integration perspective, it would be amazing!
It would make it seamless to interoperate Rust code using ndarray
with systems developed in other language ecosystems using Apache Arrow (an ever-growing list as far as I can see, especially in the ML/Big Data space).
From a technical perspective, I am afraid I don't have enough mastery of the nitty-gritty details of ndarray
's internal memory representation (or Arrow's :sweat_smile:) to judge if there might be issues/incompatibilities.
@jturner314 / @bluss are probably the best suited to give a high-level feasibility judgement - I'd be happy to work on this myself if it's indeed viable.
With respect to your questions:
handling of Nulls, Arrow has a bitmap, does ndarray use a sentinel value?
You would generally use Option<T>
if you are handling nullable values.
Arrow Boolean Arrays are bit packed, ndarray's might not
I am not aware of any optimization regarding boolean arrays in ndarray
.
Hi Everyone,
I'm working on dictionary support (categories/factors) for the rust arrow implementation.
The simplest thing would be to support bit packed bool vectors in ndarray. These could be zipped with the data vectors.
I'm proposing to write iterator support for the arrow arrays in the near future which could include the bitarrays. The trick is to get the loops to vectorise.
It would be nice if there was an intrinsic in Rust to assert pointer alignment. There may be such a thing as it does exist in LLVM.
Arrow's 8 byte alignment is enough for pretty much every SIMD we are likely to encounter except for power PC.
Andy.
On Sun, Jan 5, 2020 at 11:21 AM Luca Palmieri [email protected] wrote:
Looking at this purely from an ecosystem integration perspective, it would be amazing! It would make it seamless to interoperate Rust code using ndarray with systems developed in other language ecosystems using Apache Arrow (an ever-growing list as far as I can see).
From a technical perspective, I am afraid I don't have enough mastery of the nitty-gritty details of ndarray's internal memory representation (or Arrow's 😅) to judge if there might be issues/incompatibilities. @jturner314 https://github.com/jturner314 / @bluss https://github.com/bluss are probably the best suited to give a high-level feasibility judgement - I'd be happy to work on this myself if it's indeed viable.
With respect to your questions:
handling of Nulls, Arrow has a bitmap, does ndarray use a sentinel value?
You would generally use Option<T> if you are handling nullable values.
Arrow Boolean Arrays are bit packed, ndarray's might not
I am not aware of any optimization regarding boolean arrays in ndarray.
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/rust-ndarray/ndarray/issues/771?email_source=notifications&email_token=AAL36XC6IJ5JHPSINYLFDRTQ4G7CBA5CNFSM4KBPYCSKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEIDUMKI#issuecomment-570902057, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAL36XGMK3HPGSH525S4WCLQ4G7CBANCNFSM4KBPYCSA .
people could contribute to ndarray-stats instead of building a stats package that works with Arrow
@paddyhoran Exactly this is why I suggested the cross-dependency. While both Rust Arrow and ndarray(-stats) is awesome, they don't have the "critical mass" yet, together they could be much better 🤞
X86 is generally the best of all CPUs for alignment requirements as it will access data on two beats with at any alignment relatively little overhead. The exception may be "exotic" memory, such as GPU buffers and DMA I/O spaces.
ARM can be a little fussy, but generally 4 bytes is good enough. Again, this depends very much on the flavour of ARM CPU and bus. Also the LLVM implementation is super-conservative and generates terrible code. I'm talking to Philippe the next time I'm in Cambridge about this.
PPC is always the hard one as several instructions are needed to make an unaligned load or store.
Worth noting the benefits of non-temporal stores on X86 which significantly improve the store speed which will otherwise become a bottleneck for all vector operations. Hard to achieve in Rust as LLVM is resistant to the concept beyond basic intrinsic support. You need to put a fence at the end of the loop to make it thread safe.
But on the whole 128 or 256 bit alignment is likely to be a win.
The current format of ndarray
's ArrayBase
type is provided here. For owned arrays, the data
field has type OwnedRepr<A>
, which is just a thin wrapper around Vec<A>
.
The best resources I've found about the Apache Arrow Tensor format are the following:
- Tensor (Multi-dimensional Array)
- Tensor.fbs
- implementation of conversions between NumPy arrays and Arrow Tensors
From what I can tell, the only restrictions on the memory layout of the data are that:
- The address in bytes of the first element must be a multiple of 64 (i.e. 512-bit alignment).
- All strides must be non-negative.
Given this, creating an ArrayBase
from an Arrow tensor should be possible as a zero-copy operation in the case where the strides are a multiple of the element size. (This should be the common case, but the format doesn't appear to require the strides to be a multiple of the element size since they're specified in bytes and not element counts.)
Going the other way (from ArrayBase
to an Arrow tensor) is not as straightforward. It looks we need to ensure the alignment of the first element is a multiple of 64 bytes and return an error in the case of negative strides (that's how Python Arrow implementation handles negative strides).
For owned arrays, we could allocate with 64-byte alignment when ndarray
is performing the allocation, assuming this change does not have a large performance cost. This won't help when the data is allocated outside of ndarray
(e.g. a user converting a Vec
to an Array
) or after slicing operations (which may make the first element not be 64-byte aligned). I'm surprised not to see any alignment checks in the conversion from a NumPy array to an Arrow tensor. How does the NumPy/Python implementation handle converting NumPy arrays to Arrow tensors in these cases (a NumPy array created from a user-provided buffer which may not be 64-byte aligned or a NumPy array that has been sliced such that the first element is not 64-byte aligned)?
It would also be helpful to have a better understanding of the ownership semantics of the arrow
crate's Buffer
and MutableBuffer
types. (At first glance based on the source code, it looks like they're intended to be analogous to Arc<[T]>
and Vec<T>
, but the owned
field of BufferData
on the master
branch makes me question this.)
Hi @jturner314,
Thanks for the detailed response.
creating an
ArrayBase
from an Arrow tensor should be possible as a zero-copy operation in the case where the strides are a multiple of the element size
Yes, I think this scenario should be, relatively, straight forward.
This should be the common case, but the format doesn't appear to require the strides to be a multiple of the element size since they're specified in bytes and not element counts.
AFAIK the strides would always be a multiple of the element size and the use of bytes in the strides is an implementation detail (which could be changed).
Going the other way (from
ArrayBase
to an Arrow tensor) is not as straightforward. It looks we need to ensure the alignment of the first element is a multiple of 64 bytes and return an error in the case of negative strides (that's how Python Arrow implementation handles negative strides).
Yes, this is where adoption of the Arrow functions that allocate memory would help. If ndarray
could be changed to use the functions here then memory would already be aligned and could be passed to Arrow zero-copy. Of course, you can't ensure aligned memory from user input. Alignment would be checked at conversion time. I'm not sure about the numpy
case and why we don't check the alignment, I'm not too familiar with the Python implementation. Another consideration is that Arrow pads to 64 bytes, this simplifies some SIMD code, i.e. SIMD can operate on this padded region without worry (initially we should not worry about these details in the Arrow -> ndarray case).
It would also be helpful to have a better understanding of the ownership semantics of the
arrow
crate'sBuffer
andMutableBuffer
types. (At first glance based on the source code, it looks like they're intended to be analogous toArc<[T]>
andVec<T>
, but theowned
field ofBufferData
on themaster
branch makes me question this.)
My understanding is as follows:
-
MutableBuffer
is purely for creatingBuffer
objects, i.e. inBufferBuilder
, and is never used as the buffer backing an array. All data in Arrow is immutable. - All arrays are backed by the
ArrayData
type and are reference counted, i.e. operations like slicing update theoffset
and don't allocate - the
owned
field ofBufferData
was introduced relatively recently and was added to aid in situations where data had been allocated outside of Arrow AFAIK, I need to look into this.
I think I will start small and add an implementation of converting from Arrow tensor to ArrayBase
and we can expand from there and work through the details.
One key question is in regard to how ndarray
represents null
values, i.e. NaN
, etc. I'm assuming you use a sentinal value?
If
ndarray
could be changed to use the functions here then memory would already be aligned and could be passed to Arrow zero-copy.
My concern is that simply allocating with 64-byte alignment does not ensure that the first element of an array will be 64-byte aligned, due to the possibility of slicing. Consider the following Python example:
import numpy as np
def is_first_elem_64_byte_aligned(array):
return array.__array_interface__['data'][0] % 64 == 0
aligned = np.zeros(10, dtype=np.uint8)
print(is_first_elem_64_byte_aligned(aligned))
unaligned = aligned[1:]
print(is_first_elem_64_byte_aligned(unaligned))
The example prints
True
False
In other words, the allocation is 64-byte aligned, but the first element of the array unaligned
is not 64-byte aligned, which I understand to be a requirement for (zero-copy) creating an Arrow tensor. That's why I'm curious how NumPy handles this, because the same issue would occur for ndarray
. (Maybe this is what the offset
field of Buffer
is for?)
One key question is in regard to how
ndarray
representsnull
values, i.e.NaN
, etc. I'm assuming you use a sentinal value?
ndarray
doesn't have explicit support for null
/invalid/missing values. (ndarray
doesn't have an equivalent of NumPy's masked arrays feature, which I think is what you're asking about?) For arrays of floating-point (f32
/f64
), you can use NaN
. For any arbitrary type T
, you can make the element type of the array be Option<T>
so that missing values can be represented as None
, although this probably isn't great for performance. For non-floating-point element types, I would think using a separate mask array would provide better performance.
In other words, the allocation is 64-byte aligned, but the first element of the array
unaligned
is not 64-byte aligned, which I understand to be a requirement for (zero-copy) creating an Arrow tensor. That's why I'm curious how NumPy handles this, because the same issue would occur forndarray
. (Maybe this is what theoffset
field ofBuffer
is for?)
I'll need to look into how numpy
handles this but yes, this is what offset
if for. In this case, the underlying data is aligned but offset is updated. As all data in Arrow is immutable we can share zero copy in this way.
As you mentioned on #776 I think there's not much you can do about alignment of slices and ensuring that the first element is aligned at allocation covers alot of common use cases. Technically, in your numpy example above you could have SIMD operate on the underlying data (without the offset
) as it is still aligned and return the result with the offset
. This is getting into the weeds though as it depends on the size of the slice relative to the size of the original array and is probably not worth worrying about.
I think you are right about investigating how numpy
handles all these concerns. Thanks for all the detailed responses. I won't take anymore of your time, I need to find time to do my own research and open some draft PR's.
Thanks again
Just one note before you draft a PR -- I'd prefer for ndarray
not to have a required dependency on the arrow
crate since it includes lots of functionality unnecessary for ndarray
. Functions/types for 64-byte aligned allocation should be simple enough to put directly in ndarray
instead of using arrow
's Buffer
. (Prototypes depending on arrow
are fine for discussion purposes and performance evaluation, but I don't plan to merge a PR with a required dependency on arrow
.) Conversions to and from arrow
types can be provided with an optional dependency on arrow
.
Yep, makes total sense. Thanks @jturner314
Just one note before you draft a PR -- I'd prefer for
ndarray
not to have a required dependency on thearrow
crate since it includes lots of functionality unnecessary forndarray
. Functions/types for 64-byte aligned allocation should be simple enough to put directly inndarray
instead of usingarrow
'sBuffer
. (Prototypes depending onarrow
are fine for discussion purposes and performance evaluation, but I don't plan to merge a PR with a required dependency onarrow
.) Conversions to and fromarrow
types can be provided with an optional dependency onarrow
.
That makes sense if Arrow's memory allocation algorithm will never change. If it will, ndarray
will need to monkey-patch it. If the Arrow's memory.rs
interface is public and stable but may change its internals, then it might be better to use it instead of writing your own memory allocation code. The ideal situation would be if the Arrow project created a lightweight arrow-integration
crate which could be used for that purpose without pulling unnecessary functionality @jturner314 had mentioned.
Integrating with Arrow seems very promising. The only "disappointment" is that Arrow focuses on columnar data, and its multidimensional type - Tensor - does not have much features or focus.
The default owned Array - would want to be able to abstract away allocation strategy, but is today in practice tied to Vec. In practice it offers zero-copy conversion from Vec<T>
(important) and also the other direction (less important). This is not simultaneously compatible with a higher alignment requirement or using a Arrow allocator.
In the long run I'd like to do maybe A and definitely B.
A) Remove the strong ties between ndarray's Array
type and the Rust Vec
B) Traitify ndarray more, so that the use of the concrete types - ArrayBase, Array, etc, becomes much less important
Unfortunately I haven't had much time for ndarray. My time & interest remains with the fundamentals of ndarray - like these questions.
I'll say that if a PR doesn't address (A), then an arrow allocated array needs to use a different type, for example it could be a new storage type.
I've been reading the Buffer code -- from docs.rs/arrow -- just to understand a bit, and I can't help but mention the various questions I have. All after a superficial understanding of the code, unfortunately. Still, I hope it helps. I'm using the principle that if I see something, I prefer to say something. cc @paddyhoran
I'm sorry that I am not signing up for the Arrow JIRA at this point, so I'll mention it here
Soundness bugs
-
Buffer::empty
creates a buffer with a null pointer, and in combination the methods that create slices, for exampleBuffer::data
mean that this is a soundness problem - slices must not be created with null pointers, not even if they have length 0. (This problem is classed as always UB, and should be fixed). - What I can see, there is no check for allocation success, so any buffer can be created with a null pointer, which leads to soundness problems in most methods. Best look into using
std::alloc::handle_alloc_error
or alternatives. (This problem means that the mutablebuffer is not a safe abstraction, and it should preferably not be exposed as public API like this.) - The safe function
Buffer::from_raw_parts
in combination with most methods allows dereferencing arbitrary pointers, also invalid pointers or making reads of valid pointers that alias memory borrowed exclusively somewhere else, and so on. (This problem means that the buffer is not a safe abstraction, and it should preferably not be exposed as public API like this.)
I've been reading the Buffer code -- from docs.rs/arrow -- just to understand a bit, and I can't help but mention the various questions I have. All after a superficial understanding of the code, unfortunately. Still, I hope it helps. I'm using the principle that if I see something, I prefer to say something. cc @paddyhoran
It absolutely does help! Thanks for taking the time to review.
I'm sorry that I am not signing up for the Arrow JIRA at this point, so I'll mention it here
No, worries. I don't blame you.
I think that most of what you are describing is fixed / addressed in this PR. Would you mind take a quick look to confirm? I can open JIRA's for anything else.
buffer is not a safe abstraction, and it should preferably not be exposed as public API like this.
I think we need to focus on how we can make it a safe abstraction. The Buffer
type is central to how Arrow is designed and there is an implementation in each language supporting Arrow.
Integrating with Arrow seems very promising. The only "disappointment" is that Arrow focuses on columnar data, and its multidimensional type - Tensor - does not have much features or focus.
This is true today (functionality may expand in the future), it's main feature is that it is trivial to convert to other Arrow types without copying data. The tensor type is mostly used to integrate with other frameworks (initially tensorflow, etc. on the Python side).
The advantage would be that if you could zero-copy convert from ndarray
to Arrow's tensor type then you could save to Parquet or use Arrow flight or anything else in the Arrow ecosystem.
As it stands, arrow's memory allocation functionality can not be used to allocate a general ndarray owned array, because of the hard-coded 64-byte alignment, just because this will be incorrect for element types that require higher alignment than 64; with the alignment attribute, I suppose this is easy to create. This is one minor technical niggle. I suppose Arrow defines its own type system, so it doesn't necessarily have to need the same generality.
An arrow-allocated array in ndarray could have the same kind of restriction of element types, to only those permitted by the arrow model. Certainly conversions to Arrow types would be type system restricted in this way.
(I suppose this is a tiresome approach, the nitpicks will never end, but this is part of working with unsafe
Rust. ndarray as a crate certainly does not live up to all the rules that it might need to, and it's an ongoing work to make it fall in line more with for example the stacked borrows model, best practices for uninit data, and so on; and those rulesets are themselves works in progress. I feel like I had to write down #796, which has mostly been in my mind so far I think 🙂 )
Thanks again for your review and input @bluss. I opened ARROW-8480 to track the remaining issue you found.
ARROW-8480 has now been resolved (https://issues.apache.org/jira/browse/ARROW-8480), but uses an unstable AllocRef API that's part of the allocator_api
@nevi-me It doesn't really look fixed unfortunately. impl<T: AsRef<[u8]>> From<T> for Buffer
still uses an allocation without checking for success - so it's a use of null pointer that's easy to spot. And in the Buffer/MutableBuffer, null is silently stored on allocation failure which is a design I don't agree with, too easy to have issues left after that. MutableBuffer::new_null
looks like it has exactly that issue - use of null pointer. I was reading code from current master.
How aboutarrow2
,which uses std::Vec as backend.
https://crates.io/crates/arrow2
Any plans on implementing this for arrow2?