polars icon indicating copy to clipboard operation
polars copied to clipboard

Error constructing DataFrame from dict scalar

Open liheinz opened this issue 1 year ago • 5 comments

Checks

  • [X] I have checked that this issue has not already been reported.
  • [X] I have confirmed this bug exists on the latest version of Polars.

Reproducible example


import polars as pl
print(pl.__version__)
pl.DataFrame({"test": 0, "nest": {"test": 0, "test2": 0}})

Log output

Traceback (most recent call last):
  File "/Users/lheinz/work/jupyter/pl.py", line 4, in <module>
    pl.DataFrame({"test": 0, "nest": {"test": 0, "test2": 0}})
  File "/Users/lheinz/work/jupyter/.venv/lib/python3.12/site-packages/polars/dataframe/frame.py", line 373, in __init__
    self._df = dict_to_pydf(
               ^^^^^^^^^^^^^
  File "/Users/lheinz/work/jupyter/.venv/lib/python3.12/site-packages/polars/_utils/construction/dataframe.py", line 142, in dict_to_pydf
    pydf = PyDataFrame(data_series)
           ^^^^^^^^^^^^^^^^^^^^^^^^
polars.exceptions.ShapeError: could not create a new DataFrame: series "test" has length 2 while series "nest" has length 1

Issue description

When reading nested data polars throws an error: ShapeError: could not create a new DataFrame: series "test" has length 2 while series "nest" has length 1

The error is also reporting wrong dimensions test has length 1

Expected behavior

This shoud produce a Dataframe with a struct column.

~~Was working in older versions~~

Installed versions

--------Version info---------
Polars:               0.20.16
Index type:           UInt32
Platform:             macOS-14.4-arm64-arm-64bit
Python:               3.12.1 (main, Jan  7 2024, 23:31:12) [Clang 16.0.3 ]

liheinz avatar Mar 20 '24 15:03 liheinz

@stinodego: linked to the constructor refactor?

alexander-beedie avatar Mar 20 '24 20:03 alexander-beedie

Probably. I'm not sure this ever used to work though. The problem seems to be in arrlen where it infers length 2 for the struct 'scalar'.

stinodego avatar Mar 20 '24 22:03 stinodego

Probably. I'm not sure this ever used to work though. The problem seems to be in arrlen where it infers length 2 for the struct 'scalar'.

@stinodego: Ahh, I think you're right - I assumed this was a regression, but installing some earlier versions of Polars to check further makes me think it is not ;)

@linusheinz: Did this ever work for you in a different version of Polars? (And, if so, which version?)

alexander-beedie avatar Mar 21 '24 06:03 alexander-beedie

I just checked back to 20.9, and it did not work. It seems to be present in many versions.

liheinz avatar Mar 21 '24 07:03 liheinz

I just checked back to 20.9, and it did not work. It seems to be present in many versions.

Note that the easiest way to address this is to provide sequence data in the dict, which is the expected form (as each value in the dict represents a column).

pl.DataFrame({"test": [0], "nest": [{"test": 0, "test2": 0}]})
# shape: (1, 2)
# ┌─────┬───────────┐
# │ x   ┆ nest      │
# │ --- ┆ ---       │
# │ i64 ┆ struct[2] │
# ╞═════╪═══════════╡
# │ 0   ┆ {0,0}     │
# └─────┴───────────┘

Passing non-sequence data here is a little hit and miss, as we typically expand it in a fast-path for quick generation of large frames with repetitive data, like so:

pl.DataFrame({"n": range(1000), "y": -1})
# shape: (1_000, 2)
# ┌─────┬─────┐
# │ n   ┆ y   │
# │ --- ┆ --- │
# │ i64 ┆ i32 │
# ╞═════╪═════╡
# │ 0   ┆ -1  │
# │ 1   ┆ -1  │
# │ …   ┆ …   │
# │ 998 ┆ -1  │
# │ 999 ┆ -1  │
# └─────┴─────┘

alexander-beedie avatar Mar 21 '24 20:03 alexander-beedie