polars
polars copied to clipboard
map_rows with dict return value: `BindingsError: "Could not determine output type"`
Checks
-
[X] I have checked that this issue has not already been reported.
-
[X] I have confirmed this bug exists on the latest version of Polars.
Reproducible example
df = pl.DataFrame({"a": [1,2,3], "b": [10,20,20]})
def func(row):
a, b = row
return dict(c=a+b, d=a-b)
df.map_rows(func, return_dtype=pl.Struct(dict(c=pl.Int32, d=pl.Int32)))
results in BindingsError: "Could not determine output type"
. I get this error whether or not I specify return_type
.
Log output
No response
Issue description
I wasn't able to figure out how to directly set the schema for a mapped function returning multiple values (e.g., a dict which gets turned into a polars struct) except by manually casting the resulting columns after map_rows
.
Possibly related: https://github.com/pola-rs/polars/issues/10398
Expected behavior
n/a
Installed versions
--------Version info---------
Polars: 0.19.12
Index type: UInt64
Platform: Linux-3.10.0-1160.99.1.el7.x86_64-x86_64-with-glibc2.17
Python: 3.10.12 | packaged by conda-forge | (main, Jun 23 2023, 22:40:32) [GCC 12.3.0]
----Optional dependencies----
adbc_driver_sqlite: <not installed>
cloudpickle: <not installed>
connectorx: <not installed>
deltalake: <not installed>
fsspec: 2023.6.0
gevent: <not installed>
matplotlib: 3.8.0
numpy: 1.24.4
openpyxl: <not installed>
pandas: 2.1.1
pyarrow: 13.0.0
pydantic: <not installed>
pyiceberg: <not installed>
pyxlsb: <not installed>
sqlalchemy: 2.0.21
xlsx2csv: <not installed>
xlsxwriter: <not installed>
Incase it's useful information, it does work as expected with .map_batches
def func(row):
a, b = row
return dict(c=a+b, d=a-b)
df = pl.DataFrame({"a": [1,2,3], "b": [10,20,20]})
df.select(pl.map_batches(["a", "b"], func))
# shape: (1, 1)
# ┌───────────────────────────────┐
# │ a │
# │ --- │
# │ struct[2] │
# ╞═══════════════════════════════╡
# │ {[11, 22, 23],[-9, -18, -17]} │
# └───────────────────────────────┘
Stumbled across the same issue using complex data types. Seems that it does work when the number of rows isnt reduced, i.e. when a tuple is returned. For example
df = pl.DataFrame({"foo": [1, 2, 3], "bar": [-1, 5, 8]})
complex_df = df.map_rows(lambda t : t[0] + 1j * t[1], return_dtype = pl.Object)
Fails with the described issue. However
complex_df = df.map_rows(lambda t : (t[0] + 1j * t[1], ), return_dtype = pl.Object)
seems to work.
I am not sure whether it is related or even undesired, but it also seems that type inference trumps explicit type statements?
t = df.map_rows(lambda t: (t[0] * 2 + t[1]), return_dtype=pl.Object)
has a resulting column of type i64
, not object
as specified.
@nzqo You never want to deal with Objects. If you see you've created an Object go back a step and try to figure out how to get the data in with a native dtype.
@shenker Generally speaking map_rows
is expecting the return of your function to be a tuple where each element of the tuple will end up being a column
To (mostly) fix it all you need to do is add a comma like this:
def func(row):
a, b = row
return dict(c=a+b, d=a-b),
Then you can do
df.map_rows(func)
shape: (3, 1)
┌───────────┐
│ column_0 │
│ --- │
│ struct[2] │
╞═══════════╡
│ {11,-9} │
│ {22,-18} │
│ {23,-17} │
└───────────┘
It will still ignore your return_dtype unless it's one of these non-nested dtypes: https://github.com/pola-rs/polars/blob/62739955e43b36d8c60ac8be8953fa3eff1fd26b/py-polars/src/dataframe.rs#L1365-L1376
There also doesn't appear to be a way to set multiple dtypes if you have multiple elements in your return tuple. For instance you could do:
def func(row):
a, b = row
return (a+b, a-b)
df.map_rows(func)
but return_dtype
isn't setup to accept more than a single value so you have to let it auto infer the dtype.
I think this is more of a feature request to have nested dtypes supported and to have the return_dtype accept multiple arguments than a bug per se. I suppose the error message should be improved so that's still a bug.
@nzqo You never want to deal with Objects. If you see you've created an Object go back a step and try to figure out how to get the data in with a native dtype.
I understand, but as long as Object is a valid dtype as per the documentation I would expect that to work, putting aside the question of whether it's the best way to do things
Just with regards to map_rows
itself - are there any particular use cases for it?
It seems like it's esentially doing func(row) for row in df.iter_rows()
?