rust-numpy icon indicating copy to clipboard operation
rust-numpy copied to clipboard

Support array of &str / unicode_ / string

Open jorgecarleitao opened this issue 5 years ago • 2 comments

Like described, it would allow storing numpy native operations.

jorgecarleitao avatar Jul 13 '20 12:07 jorgecarleitao

Arrays containing PyString should already be supported as PyArray<PyObject>. Descriptors like <Un will probably need const generics though.

adamreichold avatar Mar 22 '22 17:03 adamreichold

The main problem with the <Un types that I see is that it is difficult to decide at compile time which T should be used in PyArray<T, D>. [Py_UCS4; N] would make sense from a memory layout point of view, but I don't think this too helpful as the maximum size of the elements is usually not know at compile time. But we do assume T: Sized in PyArray<T, D> so I do not really see how PyArray could be made to handle this case directly. :thinking:

adamreichold avatar Mar 24 '22 17:03 adamreichold

This would be really helpful to us if anyone is attempting to do it. I can give it a shot as well.

but I don't think this too helpful as the maximum size of the elements is usually not know at compile time

In our case (writing a parser) we know the value (e.g. it's 2 or 12) at compile time. And we now have const generics.

rachtsingh avatar May 11 '23 17:05 rachtsingh

This would be really helpful to us if anyone is attempting to do it.

From my point of view this is basically blocked on #186 which is blocked on more versatile buffer protocol support in PyO3, c.f. #321

In our case (writing a parser) we know the value (e.g. it's 2 or 12) at compile time.

Are you sure that NumPy will not dynamically use a size between 2 or 12 if the strings in the array are all shorter than 12 code points? Do you set the dtype explicitly when constructing the arrays?

And we now have const generics.

I would still advise on waiting a bit until we bump our MSRV from 1.48 to most likely 1.56 when Debian Bookworm is released as you will need to provide some build.rs-#[cfg(..)]-boilerplate to disable the support on Rust 1.48.

Of course, creating a draft PR which works expect for the MSRV build might still be a good thing to start discussing the work. I think

impl<const N: usize> Element for [Py_UCS4; N] { .. }

and a test of your use case would be the minimum required?

adamreichold avatar May 11 '23 17:05 adamreichold

Hmm, I can't see a relation between supporting e.g. a U2 arrays and record types (which are, in my opinion, not very standard practice in the numpy world), but I could be easily missing the context.

Are you sure that NumPy will not dynamically use a size between 2 or 12 if the strings in the array are all shorter than 12 code points? Do you set the dtype explicitly when constructing the arrays?

NumPy doesn't appear to automatically compact the dtype of arrays, i.e. it supports the case where the dtype is U<M and all strings are of max size N < M:

In [9]: x = np.array(['a', 'b'], dtype='<U12')

In [10]: x
Out[10]: array(['a', 'b'], dtype='<U12')

As to explicitly setting the dtype: I'm not entirely sure what you mean, but we can; our explicit use case would be writing (in Rust using PyO3/maturin) a function that takes a file and returns a numpy array of type <U12; raising an error on encountering a string with >12 code points would be fine here.

I would still advise on waiting a bit until we bump our MSRV from 1.48 to most likely 1.56

Makes sense, thanks for the context. Do you have a vague guess as to when that'll be?

I think ... and a test of your use case would be the minimum required?

Sounds good. This isn't a priority for us right now so I can't promise this will be done with any haste, but thanks for pointing out where to start.

rachtsingh avatar May 11 '23 19:05 rachtsingh

Hmm, I can't see a relation between supporting e.g. a U2 arrays and record types (which are, in my opinion, not very standard practice in the numpy world), but I could be easily missing the context.

The main thing is that we need a more general integration with Python's buffer protocol where we do not assume that the element type T of PyArray<T, D> has a fixed size known at compile time. As written above [Py_UCS; N] can be supported right now, but it is a pretty limited form of supporting unicode arrays.

As to explicitly setting the dtype: I'm not entirely sure what you mean

I meant you did, pass dtype='<U12' to the np.array constructor. Without it, one gets e.g.

>>> np.array(['a', 'b'])
array(['a', 'b'], dtype='<U1')

where NumPy automatically chooses the smallest possible <Un dtype.

Do you have a vague guess as to when that'll be?

Debian Bookworm is expected to be released on 2023-06-10 which is what we are waiting for to bump our MSRV.

adamreichold avatar May 11 '23 19:05 adamreichold

@rachtsingh Maybe #378 is already useful for you? Merging will have to wait for our MSRV bump though as discussed above.

adamreichold avatar May 14 '23 18:05 adamreichold