polars icon indicating copy to clipboard operation
polars copied to clipboard

map_rows with dict return value: `BindingsError: "Could not determine output type"`

Open shenker opened this issue 1 year ago • 5 comments

Checks

  • [X] I have checked that this issue has not already been reported.

  • [X] I have confirmed this bug exists on the latest version of Polars.

Reproducible example

df = pl.DataFrame({"a": [1,2,3], "b": [10,20,20]})
def func(row):
    a, b = row
    return dict(c=a+b, d=a-b)
df.map_rows(func, return_dtype=pl.Struct(dict(c=pl.Int32, d=pl.Int32)))

results in BindingsError: "Could not determine output type". I get this error whether or not I specify return_type.

Log output

No response

Issue description

I wasn't able to figure out how to directly set the schema for a mapped function returning multiple values (e.g., a dict which gets turned into a polars struct) except by manually casting the resulting columns after map_rows.

Possibly related: https://github.com/pola-rs/polars/issues/10398

Expected behavior

n/a

Installed versions

--------Version info---------
Polars:              0.19.12
Index type:          UInt64
Platform:            Linux-3.10.0-1160.99.1.el7.x86_64-x86_64-with-glibc2.17
Python:              3.10.12 | packaged by conda-forge | (main, Jun 23 2023, 22:40:32) [GCC 12.3.0]

----Optional dependencies----
adbc_driver_sqlite:  <not installed>
cloudpickle:         <not installed>
connectorx:          <not installed>
deltalake:           <not installed>
fsspec:              2023.6.0
gevent:              <not installed>
matplotlib:          3.8.0
numpy:               1.24.4
openpyxl:            <not installed>
pandas:              2.1.1
pyarrow:             13.0.0
pydantic:            <not installed>
pyiceberg:           <not installed>
pyxlsb:              <not installed>
sqlalchemy:          2.0.21
xlsx2csv:            <not installed>
xlsxwriter:          <not installed>

shenker avatar Nov 06 '23 16:11 shenker

Incase it's useful information, it does work as expected with .map_batches

def func(row):
    a, b = row
    return dict(c=a+b, d=a-b)
    
df = pl.DataFrame({"a": [1,2,3], "b": [10,20,20]})

df.select(pl.map_batches(["a", "b"], func))

# shape: (1, 1)
# ┌───────────────────────────────┐
# │ a                             │
# │ ---                           │
# │ struct[2]                     │
# ╞═══════════════════════════════╡
# │ {[11, 22, 23],[-9, -18, -17]} │
# └───────────────────────────────┘

cmdlineluser avatar Nov 06 '23 17:11 cmdlineluser

Stumbled across the same issue using complex data types. Seems that it does work when the number of rows isnt reduced, i.e. when a tuple is returned. For example

df = pl.DataFrame({"foo": [1, 2, 3], "bar": [-1, 5, 8]})
complex_df = df.map_rows(lambda t : t[0] + 1j * t[1], return_dtype = pl.Object)

Fails with the described issue. However

complex_df = df.map_rows(lambda t : (t[0] + 1j * t[1], ), return_dtype = pl.Object)

seems to work.

I am not sure whether it is related or even undesired, but it also seems that type inference trumps explicit type statements?

t = df.map_rows(lambda t: (t[0] * 2 + t[1]), return_dtype=pl.Object)

has a resulting column of type i64, not object as specified.

nzqo avatar Jan 24 '24 10:01 nzqo

@nzqo You never want to deal with Objects. If you see you've created an Object go back a step and try to figure out how to get the data in with a native dtype.

@shenker Generally speaking map_rows is expecting the return of your function to be a tuple where each element of the tuple will end up being a column

To (mostly) fix it all you need to do is add a comma like this:

def func(row):
    a, b = row
    return dict(c=a+b, d=a-b),

Then you can do

df.map_rows(func)
shape: (3, 1)
┌───────────┐
│ column_0  │
│ ---       │
│ struct[2] │
╞═══════════╡
│ {11,-9}   │
│ {22,-18}  │
│ {23,-17}  │
└───────────┘

It will still ignore your return_dtype unless it's one of these non-nested dtypes: https://github.com/pola-rs/polars/blob/62739955e43b36d8c60ac8be8953fa3eff1fd26b/py-polars/src/dataframe.rs#L1365-L1376

There also doesn't appear to be a way to set multiple dtypes if you have multiple elements in your return tuple. For instance you could do:

def func(row):
    a, b = row
    return (a+b, a-b)
df.map_rows(func)

but return_dtype isn't setup to accept more than a single value so you have to let it auto infer the dtype.

I think this is more of a feature request to have nested dtypes supported and to have the return_dtype accept multiple arguments than a bug per se. I suppose the error message should be improved so that's still a bug.

deanm0000 avatar Jan 24 '24 18:01 deanm0000

@nzqo You never want to deal with Objects. If you see you've created an Object go back a step and try to figure out how to get the data in with a native dtype.

I understand, but as long as Object is a valid dtype as per the documentation I would expect that to work, putting aside the question of whether it's the best way to do things

nzqo avatar Jan 24 '24 19:01 nzqo

Just with regards to map_rows itself - are there any particular use cases for it?

It seems like it's esentially doing func(row) for row in df.iter_rows()?

cmdlineluser avatar Jan 24 '24 19:01 cmdlineluser