polars icon indicating copy to clipboard operation
polars copied to clipboard

NumPy interop to do list - `to_numpy`

Open stinodego opened this issue 1 year ago • 6 comments

We've made some improvements to our native to_numpy functionality recently. Making an issue to track what's still left:

  • [ ] Handle nested types (in Rust):
    • [x] Array: Handle nulls: https://github.com/pola-rs/polars/issues/14268
    • [ ] Struct: (?) Unnest and use DataFrame.to_numpy?
    • [x] List: (?) Explode and use the offsets as input for np.split?
  • [x] Handle non-nested types properly in the Rust bindings
    • [x] Decimal/Time: https://github.com/pola-rs/polars/pull/14296
    • [x] Make Datetime/Duration/Date directly return the correct NumPy type rather than creating a view afterwards in Python: https://github.com/pola-rs/polars/pull/14353
  • [x] Make sure things work correctly for chunked Series: #14340
  • [x] Make sure things work correctly for Datetimes with timezones: #14337
  • [x] Add dedicated error type for zero copy violations: https://github.com/pola-rs/polars/pull/14350
  • [x] Add option to DataFrame.to_numpy to raise on copy (like Series.to_numpy(zero_copy_only=True) and to return writeable array.
  • [ ] Add option to Series.to_numpy to allow structured output - relevant for handling Struct types.
  • [x] Get rid of the SeriesView class in Python, handle views differently.
  • [ ] Support output of masked arrays
  • [ ] Remove use_pyarrow parameter and default to native implementation (after native functionality is done)

Now that our to_numpy can handle things properly and zero copy where possible, I'm not sure the NumPy array interface protocol (https://github.com/pola-rs/polars/pull/14214) is still useful.

stinodego avatar Feb 07 '24 11:02 stinodego

On this subject, could you take another look at #7283 and decide whether it's potentially useful or a definite no-go?

s-banach avatar Feb 14 '24 14:02 s-banach

@stinodego I was looking at the Struct type to-do's and wondering if you guys have seen that Numpy has a similar structure for it: https://numpy.org/doc/stable/user/basics.rec.html

Being able to cast polars struct columns to numpy structured arrays would be helpful in our current project :)

TNieuwdorp avatar Apr 04 '24 15:04 TNieuwdorp

@stinodego I was looking at the Struct type to-do's and wondering if you guys have seen that Numpy has a similar structure for it: numpy.org/doc/stable/user/basics.rec.html

Being able to cast polars struct columns to numpy structured arrays would be helpful in our current project :)

We are aware! You can already do this from DataFrames by setting structured=True. So if you want to export a struct Series, you can do s.struct.unnest().to_numpy(structured=True)

stinodego avatar Apr 04 '24 15:04 stinodego

It would also be nice to be able to get np arrays with dtype np.object for

pl.DataFrame({"A": [[1,2]], "B":1}, {"A": pl.Array(pl.Int64, 2), "B": pl.Int32}).to_numpy()

instead of this exception ValueError: all the input array dimensions except for the concatenation axis must match exactly, but along dimension 1, the array at index 0 has size 2 and the array at index 1 has size 1

in the same way as this works

pl.DataFrame({"A": "as", "B":1}).to_numpy()
Out[21]: array([['as', 1]], dtype=object)

dpinol avatar Apr 22 '24 21:04 dpinol

It would also be nice to be able to get np arrays with dtype np.object for ...

Yes, that should be part of our design for converting nested data.

stinodego avatar Apr 23 '24 07:04 stinodego

Regarding the design for nested types, some of my thoughts:

For converting Series to NumPy...

  • Array types become an ND array with 2 or more dimensions. This is different from how PyArrow handles them (they create a 1D object array), but it allows us to do things zero-copy and it feels correct to respect the array dimensions here. However, it's slightly surprising that converting a Series to NumPy can result in an array with more than 1 dimension (and complicates things, see below).

  • If we accept that Series can have more than 1 dimension, I think Struct types should also become an ND array with 2 (or more) dimensions. A Struct Series should be unnested and call DataFrame.to_numpy. So, for example, it may become a 2D object array if it contains an Int8 and a String field, or it may become a 2D float64 array if it contains an Int32 and a Float64 field.

  • List types do not pose a problem as they become 1D object arrays.

For converting DataFrames to NumPy...

  • Array and Struct types are problematic as they can have multiple dimensions and as such are not simply stackable. If the dimensions across columns do not match, we have to convert these to 1D object arrays before stacking them. However, if the dimensions do match, it could be appropriate to create a 3D+ array. For example, if I have a DataFrame with two Array columns with shape (2, 5) with a numeric data type, we could create a 3D array with shape (2, 2, 5). But maybe this is getting too complicated and we should restrict DataFrames to produce 2D ndarrays?

Basically, I'm trying to figure out if it's worth going through the rabbithole of multidimensional arrays, or whether maybe we should keep it simple and have Series be 1D and DataFrames be 2D. That possibly involves changing the behavior for Array types.

stinodego avatar May 20 '24 10:05 stinodego

Regarding nested types, I have decided that for now it will work as follows:

  • Series Arrays will be multidimensional
  • Series Structs will be 2 dimensional
  • DataFrames will always be 2 dimensional. Nested Array/Struct series are cast to 1D object arrays.

Everything on the TODO list here has been done, with the exception of masked array support. I will create a separate issue for that one.

stinodego avatar May 22 '24 12:05 stinodego

@stinodego an approach that makes a lot of sense to me would be to maintain a 1-D array for all Series, use a multi-element dtype. Example:

import numpy as np

a = np.array([65535, 256], dtype=np.uint16)

# construct dtype with two u8 elements
dtype = np.dtype([
    ("first", np.uint8),
    ("second", np.uint8),
])

b = a.view(dtype)
# array([(255, 255), (0, 1)], dtype=[('first', 'u1'), ('second', 'u1')])

In this case, b is 1-D, with each element a tuple of length two.

This could also work for structs with mixed types:

dtype = np.dtype([
    ("name", "U1"),
    ("value", "uint8"),
])
a = np.array([("Ritchie", 100), ("Stijn", 100)], dtype=dtype)
a.shape
# (2,)

mcrumiller avatar May 22 '24 13:05 mcrumiller

If that is the behavior you want, you can use DataFrame.to_numpy(structured=True).

This type of array is not fit for representing Array types though. Makes sense for Structs. But I don't want it to be the default, e.g. we still need a solution for when structured=False.

stinodego avatar May 22 '24 15:05 stinodego