polars icon indicating copy to clipboard operation
polars copied to clipboard

Binary data type support for `pl.col.apply`

Open beckadd opened this issue 2 years ago • 2 comments
trafficstars

Problem description

I encoded string data into binary data by applying a python UDF across a column, but when I went to decode those bytes back into a string, I received an error indicating that the Binary data type is not supported. Could we expose Binary data to python UDFs?

beckadd avatar Jan 24 '23 03:01 beckadd

Can you provide a simple example showing your UDF and the error?

alexander-beedie avatar Jan 24 '23 05:01 alexander-beedie

Yes - see below:

from cryptography.fernet import Fernet
import polars as pl

messages = pl.DataFrame({message : ['Hello World'})

# this works successfully, converting the `message` column from string to binary

encrypted = messages.select(
    pl.col("message").apply(lambda msg: Fernet(key).encrypt(val.encode())
)


# This fails with the error below
decrypted = encrypted.select(
    pl.col("message").apply(lambda msg: Fernet(key).decrypt(val))
)

the error:

thread '<unnamed>' panicked at dtype Binary not supported
---------------------------------------------------------------------------
PanicException                            Traceback (most recent call last)
File /opt/homebrew/anaconda3/lib/python3.9/site-packages/polars/internals/expr/expr.py:3150, in Expr.apply.<locals>.wrap_f(x)
   3149 def wrap_f(x: pli.Series) -> pli.Series:  # pragma: no cover
-> 3150     return x.apply(f, return_dtype=return_dtype, skip_nulls=skip_nulls)

File /opt/homebrew/anaconda3/lib/python3.9/site-packages/polars/internals/series/series.py:3529, in Series.apply(self, func, return_dtype, skip_nulls)
   3527 else:
   3528     pl_return_dtype = py_type_to_dtype(return_dtype)
-> 3529 return wrap_s(self._s.apply_lambda(func, pl_return_dtype, skip_nulls))

PanicException: dtype Binary not supported

PanicException: Unwrapped panic from Python code

Please let me know if you need to know anything else - this was run from a Jupyter notebook.

beckadd avatar Jan 24 '23 13:01 beckadd

This still is an issue. You can work around this by applying the function on each row within an for loop, but I don't think this is the optimal way reagarding performance:

decrypted_lst = []
for row in range(encrypted.height):
      decrypted_lst.append( Fernet(key).decrypt(encrypted['message'][row]) )

decrypted = encrypted.drop('message').hstack([pl.Series('message', decrypted_lst)])

wKollendorf avatar Feb 21 '23 14:02 wKollendorf

Ran into this issue as well. Here's a small reproducible test case.

pl.DataFrame(
    {"bin": [b"\x11" * 12, b"\x22" * 12, b"\xaa" * 12]}
).select(
    pl.col("bin").apply(bytes.hex)
)

The dtype Binary not supported panic is thrown from https://github.com/pola-rs/polars/blob/e703ac2ef11547fa701ffa2755f8dbbba2e8ba8b/py-polars/src/utils.rs#L52 I tried adding another case to handle DataType::Binary, but it doesn't seem as simple as that.

josh avatar Feb 21 '23 23:02 josh