ndarray icon indicating copy to clipboard operation
ndarray copied to clipboard

Adoption of the Apache Arrow memory alignment and padding?

Open paddyhoran opened this issue 5 years ago • 24 comments

Hi,

I'm just trying to get a sense of the level of interest from the ndarray developers regarding adopting the Apache Arrow memory layout and padding.

I have been wanting to build integrations between Arrow and ndarray for some time. Today it should be easy enough to build a zero-copy converter to ndarray types. Arrow has a tensor type and this could be converted (with the optional names for dimensions in Arrow dropped).

However, without guarantees over the memory alignment and padding assumptions you could not go back to Arrow with zero-copy. The easiest way to do this would be for ndarray to use the Arrow functions that allocate memory through the Arrow Buffer type.

Arrow is attempting to make integrations between crates easier, I noticed this issue today. This is the kind of issue we could avoid.

In general, I think that Arrow and ndarray fit together quite nicely where Arrow could provide alot of help processing data and ndarray provides all the algorithms once data is cleaned and in-memory.

I'm not very familiar with the ndarray codebase, if this sounds like a good idea could you point me to where you allocate memory etc. and any other information that might help?

paddyhoran avatar Dec 30 '19 20:12 paddyhoran

@paddyhoran if this happens would it make sense to use ndarray-stats for Arrow compute kernels? (I hope this is the good name for min, max, sum, count etc functions).

alippai avatar Jan 04 '20 17:01 alippai

@paddyhoran if this happens would it make sense to use ndarray-stats for Arrow compute kernels?

Yes, perhaps. We started to implement some compute kernels in Arrow and using SIMD,etc. However, the Rust community is relatively small to begin with. ndarray is the most established numerical library in the Rust ecosystem so if we had seemless integration between the two it would have a lot of mutual benefits, for instance:

  • people could contribute to ndarray-stats instead of building a stats package that works with Arrow
  • ndarray leveraging Arrow to use parquet, Arrow Flight, etc.

There are probably some things that we need to resolve, for instance:

  • handling of Nulls, Arrow has a bitmap, does ndarray use a sentinel value?
  • Arrow Boolean Arrays are bit packed, ndarray's might not

I think integration between the two could open up a lot of possibilities on both sides but I'm not sure how the ndarray devs feel?

paddyhoran avatar Jan 04 '20 18:01 paddyhoran

Looking at this purely from an ecosystem integration perspective, it would be amazing! It would make it seamless to interoperate Rust code using ndarray with systems developed in other language ecosystems using Apache Arrow (an ever-growing list as far as I can see, especially in the ML/Big Data space).

From a technical perspective, I am afraid I don't have enough mastery of the nitty-gritty details of ndarray's internal memory representation (or Arrow's :sweat_smile:) to judge if there might be issues/incompatibilities. @jturner314 / @bluss are probably the best suited to give a high-level feasibility judgement - I'd be happy to work on this myself if it's indeed viable.

With respect to your questions:

handling of Nulls, Arrow has a bitmap, does ndarray use a sentinel value?

You would generally use Option<T> if you are handling nullable values.

Arrow Boolean Arrays are bit packed, ndarray's might not

I am not aware of any optimization regarding boolean arrays in ndarray.

LukeMathWalker avatar Jan 05 '20 11:01 LukeMathWalker

Hi Everyone,

I'm working on dictionary support (categories/factors) for the rust arrow implementation.

The simplest thing would be to support bit packed bool vectors in ndarray. These could be zipped with the data vectors.

I'm proposing to write iterator support for the arrow arrays in the near future which could include the bitarrays. The trick is to get the loops to vectorise.

It would be nice if there was an intrinsic in Rust to assert pointer alignment. There may be such a thing as it does exist in LLVM.

Arrow's 8 byte alignment is enough for pretty much every SIMD we are likely to encounter except for power PC.

Andy.

On Sun, Jan 5, 2020 at 11:21 AM Luca Palmieri [email protected] wrote:

Looking at this purely from an ecosystem integration perspective, it would be amazing! It would make it seamless to interoperate Rust code using ndarray with systems developed in other language ecosystems using Apache Arrow (an ever-growing list as far as I can see).

From a technical perspective, I am afraid I don't have enough mastery of the nitty-gritty details of ndarray's internal memory representation (or Arrow's 😅) to judge if there might be issues/incompatibilities. @jturner314 https://github.com/jturner314 / @bluss https://github.com/bluss are probably the best suited to give a high-level feasibility judgement - I'd be happy to work on this myself if it's indeed viable.

With respect to your questions:

handling of Nulls, Arrow has a bitmap, does ndarray use a sentinel value?

You would generally use Option<T> if you are handling nullable values.

Arrow Boolean Arrays are bit packed, ndarray's might not

I am not aware of any optimization regarding boolean arrays in ndarray.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/rust-ndarray/ndarray/issues/771?email_source=notifications&email_token=AAL36XC6IJ5JHPSINYLFDRTQ4G7CBA5CNFSM4KBPYCSKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEIDUMKI#issuecomment-570902057, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAL36XGMK3HPGSH525S4WCLQ4G7CBANCNFSM4KBPYCSA .

andy-thomason avatar Jan 06 '20 10:01 andy-thomason

people could contribute to ndarray-stats instead of building a stats package that works with Arrow

@paddyhoran Exactly this is why I suggested the cross-dependency. While both Rust Arrow and ndarray(-stats) is awesome, they don't have the "critical mass" yet, together they could be much better 🤞

alippai avatar Jan 11 '20 23:01 alippai

X86 is generally the best of all CPUs for alignment requirements as it will access data on two beats with at any alignment relatively little overhead. The exception may be "exotic" memory, such as GPU buffers and DMA I/O spaces.

ARM can be a little fussy, but generally 4 bytes is good enough. Again, this depends very much on the flavour of ARM CPU and bus. Also the LLVM implementation is super-conservative and generates terrible code. I'm talking to Philippe the next time I'm in Cambridge about this.

PPC is always the hard one as several instructions are needed to make an unaligned load or store.

Worth noting the benefits of non-temporal stores on X86 which significantly improve the store speed which will otherwise become a bottleneck for all vector operations. Hard to achieve in Rust as LLVM is resistant to the concept beyond basic intrinsic support. You need to put a fence at the end of the loop to make it thread safe.

But on the whole 128 or 256 bit alignment is likely to be a win.

andy-thomason avatar Jan 20 '20 10:01 andy-thomason

The current format of ndarray's ArrayBase type is provided here. For owned arrays, the data field has type OwnedRepr<A>, which is just a thin wrapper around Vec<A>.

The best resources I've found about the Apache Arrow Tensor format are the following:

From what I can tell, the only restrictions on the memory layout of the data are that:

  • The address in bytes of the first element must be a multiple of 64 (i.e. 512-bit alignment).
  • All strides must be non-negative.

Given this, creating an ArrayBase from an Arrow tensor should be possible as a zero-copy operation in the case where the strides are a multiple of the element size. (This should be the common case, but the format doesn't appear to require the strides to be a multiple of the element size since they're specified in bytes and not element counts.)

Going the other way (from ArrayBase to an Arrow tensor) is not as straightforward. It looks we need to ensure the alignment of the first element is a multiple of 64 bytes and return an error in the case of negative strides (that's how Python Arrow implementation handles negative strides).

For owned arrays, we could allocate with 64-byte alignment when ndarray is performing the allocation, assuming this change does not have a large performance cost. This won't help when the data is allocated outside of ndarray (e.g. a user converting a Vec to an Array) or after slicing operations (which may make the first element not be 64-byte aligned). I'm surprised not to see any alignment checks in the conversion from a NumPy array to an Arrow tensor. How does the NumPy/Python implementation handle converting NumPy arrays to Arrow tensors in these cases (a NumPy array created from a user-provided buffer which may not be 64-byte aligned or a NumPy array that has been sliced such that the first element is not 64-byte aligned)?

It would also be helpful to have a better understanding of the ownership semantics of the arrow crate's Buffer and MutableBuffer types. (At first glance based on the source code, it looks like they're intended to be analogous to Arc<[T]> and Vec<T>, but the owned field of BufferData on the master branch makes me question this.)

jturner314 avatar Jan 20 '20 22:01 jturner314

Hi @jturner314,

Thanks for the detailed response.

creating an ArrayBase from an Arrow tensor should be possible as a zero-copy operation in the case where the strides are a multiple of the element size

Yes, I think this scenario should be, relatively, straight forward.

This should be the common case, but the format doesn't appear to require the strides to be a multiple of the element size since they're specified in bytes and not element counts.

AFAIK the strides would always be a multiple of the element size and the use of bytes in the strides is an implementation detail (which could be changed).

Going the other way (from ArrayBase to an Arrow tensor) is not as straightforward. It looks we need to ensure the alignment of the first element is a multiple of 64 bytes and return an error in the case of negative strides (that's how Python Arrow implementation handles negative strides).

Yes, this is where adoption of the Arrow functions that allocate memory would help. If ndarray could be changed to use the functions here then memory would already be aligned and could be passed to Arrow zero-copy. Of course, you can't ensure aligned memory from user input. Alignment would be checked at conversion time. I'm not sure about the numpy case and why we don't check the alignment, I'm not too familiar with the Python implementation. Another consideration is that Arrow pads to 64 bytes, this simplifies some SIMD code, i.e. SIMD can operate on this padded region without worry (initially we should not worry about these details in the Arrow -> ndarray case).

It would also be helpful to have a better understanding of the ownership semantics of the arrow crate's Buffer and MutableBuffer types. (At first glance based on the source code, it looks like they're intended to be analogous to Arc<[T]> and Vec<T>, but the owned field of BufferData on the master branch makes me question this.)

My understanding is as follows:

  • MutableBuffer is purely for creating Buffer objects, i.e. in BufferBuilder, and is never used as the buffer backing an array. All data in Arrow is immutable.
  • All arrays are backed by the ArrayData type and are reference counted, i.e. operations like slicing update the offset and don't allocate
  • the owned field of BufferData was introduced relatively recently and was added to aid in situations where data had been allocated outside of Arrow AFAIK, I need to look into this.

I think I will start small and add an implementation of converting from Arrow tensor to ArrayBase and we can expand from there and work through the details.

paddyhoran avatar Jan 22 '20 18:01 paddyhoran

One key question is in regard to how ndarray represents null values, i.e. NaN, etc. I'm assuming you use a sentinal value?

paddyhoran avatar Jan 22 '20 18:01 paddyhoran

If ndarray could be changed to use the functions here then memory would already be aligned and could be passed to Arrow zero-copy.

My concern is that simply allocating with 64-byte alignment does not ensure that the first element of an array will be 64-byte aligned, due to the possibility of slicing. Consider the following Python example:

import numpy as np


def is_first_elem_64_byte_aligned(array):
    return array.__array_interface__['data'][0] % 64 == 0


aligned = np.zeros(10, dtype=np.uint8)
print(is_first_elem_64_byte_aligned(aligned))

unaligned = aligned[1:]
print(is_first_elem_64_byte_aligned(unaligned))

The example prints

True
False

In other words, the allocation is 64-byte aligned, but the first element of the array unaligned is not 64-byte aligned, which I understand to be a requirement for (zero-copy) creating an Arrow tensor. That's why I'm curious how NumPy handles this, because the same issue would occur for ndarray. (Maybe this is what the offset field of Buffer is for?)

One key question is in regard to how ndarray represents null values, i.e. NaN, etc. I'm assuming you use a sentinal value?

ndarray doesn't have explicit support for null/invalid/missing values. (ndarray doesn't have an equivalent of NumPy's masked arrays feature, which I think is what you're asking about?) For arrays of floating-point (f32/f64), you can use NaN. For any arbitrary type T, you can make the element type of the array be Option<T> so that missing values can be represented as None, although this probably isn't great for performance. For non-floating-point element types, I would think using a separate mask array would provide better performance.

jturner314 avatar Jan 24 '20 17:01 jturner314

In other words, the allocation is 64-byte aligned, but the first element of the array unaligned is not 64-byte aligned, which I understand to be a requirement for (zero-copy) creating an Arrow tensor. That's why I'm curious how NumPy handles this, because the same issue would occur for ndarray. (Maybe this is what the offset field of Buffer is for?)

I'll need to look into how numpy handles this but yes, this is what offset if for. In this case, the underlying data is aligned but offset is updated. As all data in Arrow is immutable we can share zero copy in this way.

As you mentioned on #776 I think there's not much you can do about alignment of slices and ensuring that the first element is aligned at allocation covers alot of common use cases. Technically, in your numpy example above you could have SIMD operate on the underlying data (without the offset) as it is still aligned and return the result with the offset. This is getting into the weeds though as it depends on the size of the slice relative to the size of the original array and is probably not worth worrying about.

I think you are right about investigating how numpy handles all these concerns. Thanks for all the detailed responses. I won't take anymore of your time, I need to find time to do my own research and open some draft PR's.

Thanks again

paddyhoran avatar Jan 25 '20 14:01 paddyhoran

Just one note before you draft a PR -- I'd prefer for ndarray not to have a required dependency on the arrow crate since it includes lots of functionality unnecessary for ndarray. Functions/types for 64-byte aligned allocation should be simple enough to put directly in ndarray instead of using arrow's Buffer. (Prototypes depending on arrow are fine for discussion purposes and performance evaluation, but I don't plan to merge a PR with a required dependency on arrow.) Conversions to and from arrow types can be provided with an optional dependency on arrow.

jturner314 avatar Jan 26 '20 01:01 jturner314

Yep, makes total sense. Thanks @jturner314

paddyhoran avatar Jan 26 '20 02:01 paddyhoran

Just one note before you draft a PR -- I'd prefer for ndarray not to have a required dependency on the arrow crate since it includes lots of functionality unnecessary for ndarray. Functions/types for 64-byte aligned allocation should be simple enough to put directly in ndarray instead of using arrow's Buffer. (Prototypes depending on arrow are fine for discussion purposes and performance evaluation, but I don't plan to merge a PR with a required dependency on arrow.) Conversions to and from arrow types can be provided with an optional dependency on arrow.

That makes sense if Arrow's memory allocation algorithm will never change. If it will, ndarray will need to monkey-patch it. If the Arrow's memory.rs interface is public and stable but may change its internals, then it might be better to use it instead of writing your own memory allocation code. The ideal situation would be if the Arrow project created a lightweight arrow-integration crate which could be used for that purpose without pulling unnecessary functionality @jturner314 had mentioned.

kdubovikov avatar Mar 22 '20 04:03 kdubovikov

Integrating with Arrow seems very promising. The only "disappointment" is that Arrow focuses on columnar data, and its multidimensional type - Tensor - does not have much features or focus.

The default owned Array - would want to be able to abstract away allocation strategy, but is today in practice tied to Vec. In practice it offers zero-copy conversion from Vec<T> (important) and also the other direction (less important). This is not simultaneously compatible with a higher alignment requirement or using a Arrow allocator.

In the long run I'd like to do maybe A and definitely B.

A) Remove the strong ties between ndarray's Array type and the Rust Vec B) Traitify ndarray more, so that the use of the concrete types - ArrayBase, Array, etc, becomes much less important

Unfortunately I haven't had much time for ndarray. My time & interest remains with the fundamentals of ndarray - like these questions.

I'll say that if a PR doesn't address (A), then an arrow allocated array needs to use a different type, for example it could be a new storage type.

bluss avatar Apr 11 '20 08:04 bluss

I've been reading the Buffer code -- from docs.rs/arrow -- just to understand a bit, and I can't help but mention the various questions I have. All after a superficial understanding of the code, unfortunately. Still, I hope it helps. I'm using the principle that if I see something, I prefer to say something. cc @paddyhoran

I'm sorry that I am not signing up for the Arrow JIRA at this point, so I'll mention it here

Soundness bugs

  • Buffer::empty creates a buffer with a null pointer, and in combination the methods that create slices, for example Buffer::data mean that this is a soundness problem - slices must not be created with null pointers, not even if they have length 0. (This problem is classed as always UB, and should be fixed).
  • What I can see, there is no check for allocation success, so any buffer can be created with a null pointer, which leads to soundness problems in most methods. Best look into using std::alloc::handle_alloc_error or alternatives. (This problem means that the mutablebuffer is not a safe abstraction, and it should preferably not be exposed as public API like this.)
  • The safe function Buffer::from_raw_parts in combination with most methods allows dereferencing arbitrary pointers, also invalid pointers or making reads of valid pointers that alias memory borrowed exclusively somewhere else, and so on. (This problem means that the buffer is not a safe abstraction, and it should preferably not be exposed as public API like this.)

bluss avatar Apr 12 '20 09:04 bluss

I've been reading the Buffer code -- from docs.rs/arrow -- just to understand a bit, and I can't help but mention the various questions I have. All after a superficial understanding of the code, unfortunately. Still, I hope it helps. I'm using the principle that if I see something, I prefer to say something. cc @paddyhoran

It absolutely does help! Thanks for taking the time to review.

I'm sorry that I am not signing up for the Arrow JIRA at this point, so I'll mention it here

No, worries. I don't blame you.

I think that most of what you are describing is fixed / addressed in this PR. Would you mind take a quick look to confirm? I can open JIRA's for anything else.

buffer is not a safe abstraction, and it should preferably not be exposed as public API like this.

I think we need to focus on how we can make it a safe abstraction. The Buffer type is central to how Arrow is designed and there is an implementation in each language supporting Arrow.

paddyhoran avatar Apr 12 '20 17:04 paddyhoran

Integrating with Arrow seems very promising. The only "disappointment" is that Arrow focuses on columnar data, and its multidimensional type - Tensor - does not have much features or focus.

This is true today (functionality may expand in the future), it's main feature is that it is trivial to convert to other Arrow types without copying data. The tensor type is mostly used to integrate with other frameworks (initially tensorflow, etc. on the Python side).

The advantage would be that if you could zero-copy convert from ndarray to Arrow's tensor type then you could save to Parquet or use Arrow flight or anything else in the Arrow ecosystem.

paddyhoran avatar Apr 12 '20 17:04 paddyhoran

As it stands, arrow's memory allocation functionality can not be used to allocate a general ndarray owned array, because of the hard-coded 64-byte alignment, just because this will be incorrect for element types that require higher alignment than 64; with the alignment attribute, I suppose this is easy to create. This is one minor technical niggle. I suppose Arrow defines its own type system, so it doesn't necessarily have to need the same generality.

An arrow-allocated array in ndarray could have the same kind of restriction of element types, to only those permitted by the arrow model. Certainly conversions to Arrow types would be type system restricted in this way.

(I suppose this is a tiresome approach, the nitpicks will never end, but this is part of working with unsafe Rust. ndarray as a crate certainly does not live up to all the rules that it might need to, and it's an ongoing work to make it fall in line more with for example the stacked borrows model, best practices for uninit data, and so on; and those rulesets are themselves works in progress. I feel like I had to write down #796, which has mostly been in my mind so far I think 🙂 )

bluss avatar Apr 13 '20 10:04 bluss

Thanks again for your review and input @bluss. I opened ARROW-8480 to track the remaining issue you found.

paddyhoran avatar Apr 16 '20 10:04 paddyhoran

ARROW-8480 has now been resolved (https://issues.apache.org/jira/browse/ARROW-8480), but uses an unstable AllocRef API that's part of the allocator_api

nevi-me avatar Oct 10 '20 13:10 nevi-me

@nevi-me It doesn't really look fixed unfortunately. impl<T: AsRef<[u8]>> From<T> for Buffer still uses an allocation without checking for success - so it's a use of null pointer that's easy to spot. And in the Buffer/MutableBuffer, null is silently stored on allocation failure which is a design I don't agree with, too easy to have issues left after that. MutableBuffer::new_null looks like it has exactly that issue - use of null pointer. I was reading code from current master.

bluss avatar Dec 01 '20 17:12 bluss

How aboutarrow2,which uses std::Vec as backend.

https://crates.io/crates/arrow2

TYPEmber avatar Nov 14 '22 03:11 TYPEmber

Any plans on implementing this for arrow2?

daniellga avatar Jul 08 '23 02:07 daniellga