polars
polars copied to clipboard
NumPy interop to do list - `to_numpy`
We've made some improvements to our native to_numpy
functionality recently. Making an issue to track what's still left:
- [ ] Handle nested types (in Rust):
- [x]
Array
: Handle nulls: https://github.com/pola-rs/polars/issues/14268 - [ ]
Struct
: (?) Unnest and useDataFrame.to_numpy
? - [x]
List
: (?) Explode and use the offsets as input fornp.split
?
- [x]
- [x] Handle non-nested types properly in the Rust bindings
- [x] Decimal/Time: https://github.com/pola-rs/polars/pull/14296
- [x] Make Datetime/Duration/Date directly return the correct NumPy type rather than creating a view afterwards in Python: https://github.com/pola-rs/polars/pull/14353
- [x] Make sure things work correctly for chunked Series: #14340
- [x] Make sure things work correctly for Datetimes with timezones: #14337
- [x] Add dedicated error type for zero copy violations: https://github.com/pola-rs/polars/pull/14350
- [x] Add option to
DataFrame.to_numpy
to raise on copy (likeSeries.to_numpy(zero_copy_only=True)
and to returnwriteable
array. - [ ] Add option to
Series.to_numpy
to allow structured output - relevant for handling Struct types. - [x] Get rid of the
SeriesView
class in Python, handle views differently. - [ ] Support output of masked arrays
- [ ] Remove
use_pyarrow
parameter and default to native implementation (after native functionality is done)
Now that our to_numpy
can handle things properly and zero copy where possible, I'm not sure the NumPy array interface protocol (https://github.com/pola-rs/polars/pull/14214) is still useful.
On this subject, could you take another look at #7283 and decide whether it's potentially useful or a definite no-go?
@stinodego I was looking at the Struct type to-do's and wondering if you guys have seen that Numpy has a similar structure for it: https://numpy.org/doc/stable/user/basics.rec.html
Being able to cast polars struct columns to numpy structured arrays would be helpful in our current project :)
@stinodego I was looking at the Struct type to-do's and wondering if you guys have seen that Numpy has a similar structure for it: numpy.org/doc/stable/user/basics.rec.html
Being able to cast polars struct columns to numpy structured arrays would be helpful in our current project :)
We are aware! You can already do this from DataFrames by setting structured=True
. So if you want to export a struct Series, you can do s.struct.unnest().to_numpy(structured=True)
It would also be nice to be able to get np arrays with dtype np.object
for
pl.DataFrame({"A": [[1,2]], "B":1}, {"A": pl.Array(pl.Int64, 2), "B": pl.Int32}).to_numpy()
instead of this exception ValueError: all the input array dimensions except for the concatenation axis must match exactly, but along dimension 1, the array at index 0 has size 2 and the array at index 1 has size 1
in the same way as this works
pl.DataFrame({"A": "as", "B":1}).to_numpy()
Out[21]: array([['as', 1]], dtype=object)
It would also be nice to be able to get np arrays with dtype np.object for ...
Yes, that should be part of our design for converting nested data.
Regarding the design for nested types, some of my thoughts:
For converting Series to NumPy...
-
Array types become an ND array with 2 or more dimensions. This is different from how PyArrow handles them (they create a 1D object array), but it allows us to do things zero-copy and it feels correct to respect the array dimensions here. However, it's slightly surprising that converting a Series to NumPy can result in an array with more than 1 dimension (and complicates things, see below).
-
If we accept that Series can have more than 1 dimension, I think Struct types should also become an ND array with 2 (or more) dimensions. A Struct Series should be unnested and call DataFrame.to_numpy. So, for example, it may become a 2D object array if it contains an Int8 and a String field, or it may become a 2D float64 array if it contains an Int32 and a Float64 field.
-
List types do not pose a problem as they become 1D object arrays.
For converting DataFrames to NumPy...
- Array and Struct types are problematic as they can have multiple dimensions and as such are not simply stackable. If the dimensions across columns do not match, we have to convert these to 1D object arrays before stacking them. However, if the dimensions do match, it could be appropriate to create a 3D+ array. For example, if I have a DataFrame with two Array columns with shape (2, 5) with a numeric data type, we could create a 3D array with shape (2, 2, 5). But maybe this is getting too complicated and we should restrict DataFrames to produce 2D ndarrays?
Basically, I'm trying to figure out if it's worth going through the rabbithole of multidimensional arrays, or whether maybe we should keep it simple and have Series be 1D and DataFrames be 2D. That possibly involves changing the behavior for Array types.
Regarding nested types, I have decided that for now it will work as follows:
- Series Arrays will be multidimensional
- Series Structs will be 2 dimensional
- DataFrames will always be 2 dimensional. Nested Array/Struct series are cast to 1D object arrays.
Everything on the TODO list here has been done, with the exception of masked array support. I will create a separate issue for that one.
@stinodego an approach that makes a lot of sense to me would be to maintain a 1-D array for all Series, use a multi-element dtype
. Example:
import numpy as np
a = np.array([65535, 256], dtype=np.uint16)
# construct dtype with two u8 elements
dtype = np.dtype([
("first", np.uint8),
("second", np.uint8),
])
b = a.view(dtype)
# array([(255, 255), (0, 1)], dtype=[('first', 'u1'), ('second', 'u1')])
In this case, b
is 1-D, with each element a tuple of length two.
This could also work for structs with mixed types:
dtype = np.dtype([
("name", "U1"),
("value", "uint8"),
])
a = np.array([("Ritchie", 100), ("Stijn", 100)], dtype=dtype)
a.shape
# (2,)
If that is the behavior you want, you can use DataFrame.to_numpy(structured=True)
.
This type of array is not fit for representing Array types though. Makes sense for Structs. But I don't want it to be the default, e.g. we still need a solution for when structured=False
.