polars icon indicating copy to clipboard operation
polars copied to clipboard

Calling .to_numpy() on column type pl.Array transposes output

Open claysmyth opened this issue 1 year ago • 1 comments

Checks

  • [X] I have checked that this issue has not already been reported.

  • [X] I have confirmed this bug exists on the latest version of Polars.

Reproducible example

df_tmp = pl.DataFrame({'A': np.arange(10).reshape(2,5), 'B': np.arange(10).reshape(2,5)}).with_columns(pl.all().cast(pl.Array(width=5, inner=pl.Int64)))
arr_tmp = df_tmp.select('A').to_numpy()

Issue description

When trying to extract a column of type pl.Array into a numpy array, the output is transposed. Intuitively, one might expect that each pl.Array in each row of the dataframe would correspond to rows of the outputted numpy array; however, each row of the dataframe corresponds to a column of the outputted numpy array.

The reason I raised this issue is because it seems natural to insert numpy vectors as elements in a type Array column of a polars dataframe (e.g. storing feature vectors in column 'feature_vecs' for ML purposes). Currently, inserting a 2D numpy matrix into a dataframe as a column, then immediately extracting it that column back into a numpy array, will result in inconsistent dimensions.

This is likely similar to issue #7961 , but a new issue was raised to specifically refer to columns of type Array.

Please let me know if this is not actually a bug, and I am just missing rationale for the transposition.

Expected behavior

Output of arr_tmp.shape above would be (2,5), instead of the current output (5,2)

Installed versions

0.18.15

claysmyth avatar Aug 24 '23 17:08 claysmyth

What's the intended behavior? Series.to_numpy() versus DataFrame.to_numpy() gives:

>>> s = pl.int_ranges(pl.Series([0, 6]), pl.Series([5, 11]), eager=True).cast(pl.Array(5, pl.Int32))
>>> s
shape: (2,)
Series: 'int_range' [array[i32, 5]]
[
        [0, 1, … 4]
        [6, 7, … 10]
]

With a series, we get what we expect:

>>> s.to_numpy()
array([[ 0,  1,  2,  3,  4],
       [ 6,  7,  8,  9, 10]])

But with a frame, we have:

>>> s.to_frame().to_numpy()
array([[ 0,  6],
       [ 1,  7],
       [ 2,  8],
       [ 3,  9],
       [ 4, 10]])

I would almost expect the latter to be a 3-D array, with the biggest dimension containing the series, and each series containing the arrays. But we have:

>>> s.to_frame().with_columns(pl.col("int_range").alias("b")).to_numpy() # add column
array([[ 0,  6,  0,  6],
       [ 1,  7,  1,  7],
       [ 2,  8,  2,  8],
       [ 3,  9,  3,  9],
       [ 4, 10,  4, 10]])

What's odd is that the memory layout is different: for series.to_numpy(), we have C ordering, and with frame.to_numpy(), we have fortran ordering:

>>> s.to_numpy().flags.f_contiguous
False # False means that it's C ordering
>>> s.to_frame().to_numpy().flags.f_contiguous
True # True means it's Fortran ordering

so the inner array is always contiguous in memory, but the dimensionality flips. Is this intended?

mcrumiller avatar Aug 24 '23 20:08 mcrumiller

This has been addressed in the following way:

  • Converting Array Series to NumPy results in a C-contiguous multidimensional array: https://github.com/pola-rs/polars/pull/16230
  • When converting a DataFrame to NumPy, any Array columns are converted to a 1D object array where each entry is another (possibly multidimensional) ndarray: https://github.com/pola-rs/polars/pull/16386

Supporting multidimensional arrays in the DataFrame conversion is a bridge too far for now, but we could possible do it in the future, see https://github.com/pola-rs/polars/issues/14334#issuecomment-2120141004

If that is an important use case for you, please open a new issue.

stinodego avatar May 22 '24 13:05 stinodego