polars
polars copied to clipboard
Calling .to_numpy() on column type pl.Array transposes output
Checks
-
[X] I have checked that this issue has not already been reported.
-
[X] I have confirmed this bug exists on the latest version of Polars.
Reproducible example
df_tmp = pl.DataFrame({'A': np.arange(10).reshape(2,5), 'B': np.arange(10).reshape(2,5)}).with_columns(pl.all().cast(pl.Array(width=5, inner=pl.Int64)))
arr_tmp = df_tmp.select('A').to_numpy()
Issue description
When trying to extract a column of type pl.Array into a numpy array, the output is transposed. Intuitively, one might expect that each pl.Array in each row of the dataframe would correspond to rows of the outputted numpy array; however, each row of the dataframe corresponds to a column of the outputted numpy array.
The reason I raised this issue is because it seems natural to insert numpy vectors as elements in a type Array column of a polars dataframe (e.g. storing feature vectors in column 'feature_vecs' for ML purposes). Currently, inserting a 2D numpy matrix into a dataframe as a column, then immediately extracting it that column back into a numpy array, will result in inconsistent dimensions.
This is likely similar to issue #7961 , but a new issue was raised to specifically refer to columns of type Array.
Please let me know if this is not actually a bug, and I am just missing rationale for the transposition.
Expected behavior
Output of arr_tmp.shape
above would be (2,5), instead of the current output (5,2)
Installed versions
0.18.15
What's the intended behavior? Series.to_numpy()
versus DataFrame.to_numpy()
gives:
>>> s = pl.int_ranges(pl.Series([0, 6]), pl.Series([5, 11]), eager=True).cast(pl.Array(5, pl.Int32))
>>> s
shape: (2,)
Series: 'int_range' [array[i32, 5]]
[
[0, 1, … 4]
[6, 7, … 10]
]
With a series, we get what we expect:
>>> s.to_numpy()
array([[ 0, 1, 2, 3, 4],
[ 6, 7, 8, 9, 10]])
But with a frame, we have:
>>> s.to_frame().to_numpy()
array([[ 0, 6],
[ 1, 7],
[ 2, 8],
[ 3, 9],
[ 4, 10]])
I would almost expect the latter to be a 3-D array, with the biggest dimension containing the series, and each series containing the arrays. But we have:
>>> s.to_frame().with_columns(pl.col("int_range").alias("b")).to_numpy() # add column
array([[ 0, 6, 0, 6],
[ 1, 7, 1, 7],
[ 2, 8, 2, 8],
[ 3, 9, 3, 9],
[ 4, 10, 4, 10]])
What's odd is that the memory layout is different: for series.to_numpy()
, we have C ordering, and with frame.to_numpy()
, we have fortran ordering:
>>> s.to_numpy().flags.f_contiguous
False # False means that it's C ordering
>>> s.to_frame().to_numpy().flags.f_contiguous
True # True means it's Fortran ordering
so the inner array is always contiguous in memory, but the dimensionality flips. Is this intended?
This has been addressed in the following way:
- Converting
Array
Series to NumPy results in a C-contiguous multidimensional array: https://github.com/pola-rs/polars/pull/16230 - When converting a DataFrame to NumPy, any
Array
columns are converted to a 1D object array where each entry is another (possibly multidimensional) ndarray: https://github.com/pola-rs/polars/pull/16386
Supporting multidimensional arrays in the DataFrame conversion is a bridge too far for now, but we could possible do it in the future, see https://github.com/pola-rs/polars/issues/14334#issuecomment-2120141004
If that is an important use case for you, please open a new issue.