polars
polars copied to clipboard
Binary data type support for `pl.col.apply`
Problem description
I encoded string data into binary data by applying a python UDF across a column, but when I went to decode those bytes back into a string, I received an error indicating that the Binary data type is not supported. Could we expose Binary data to python UDFs?
Can you provide a simple example showing your UDF and the error?
Yes - see below:
from cryptography.fernet import Fernet
import polars as pl
messages = pl.DataFrame({message : ['Hello World'})
# this works successfully, converting the `message` column from string to binary
encrypted = messages.select(
pl.col("message").apply(lambda msg: Fernet(key).encrypt(val.encode())
)
# This fails with the error below
decrypted = encrypted.select(
pl.col("message").apply(lambda msg: Fernet(key).decrypt(val))
)
the error:
thread '<unnamed>' panicked at dtype Binary not supported
---------------------------------------------------------------------------
PanicException Traceback (most recent call last)
File /opt/homebrew/anaconda3/lib/python3.9/site-packages/polars/internals/expr/expr.py:3150, in Expr.apply.<locals>.wrap_f(x)
3149 def wrap_f(x: pli.Series) -> pli.Series: # pragma: no cover
-> 3150 return x.apply(f, return_dtype=return_dtype, skip_nulls=skip_nulls)
File /opt/homebrew/anaconda3/lib/python3.9/site-packages/polars/internals/series/series.py:3529, in Series.apply(self, func, return_dtype, skip_nulls)
3527 else:
3528 pl_return_dtype = py_type_to_dtype(return_dtype)
-> 3529 return wrap_s(self._s.apply_lambda(func, pl_return_dtype, skip_nulls))
PanicException: dtype Binary not supported
PanicException: Unwrapped panic from Python code
Please let me know if you need to know anything else - this was run from a Jupyter notebook.
This still is an issue. You can work around this by applying the function on each row within an for loop, but I don't think this is the optimal way reagarding performance:
decrypted_lst = []
for row in range(encrypted.height):
decrypted_lst.append( Fernet(key).decrypt(encrypted['message'][row]) )
decrypted = encrypted.drop('message').hstack([pl.Series('message', decrypted_lst)])
Ran into this issue as well. Here's a small reproducible test case.
pl.DataFrame(
{"bin": [b"\x11" * 12, b"\x22" * 12, b"\xaa" * 12]}
).select(
pl.col("bin").apply(bytes.hex)
)
The dtype Binary not supported panic is thrown from https://github.com/pola-rs/polars/blob/e703ac2ef11547fa701ffa2755f8dbbba2e8ba8b/py-polars/src/utils.rs#L52 I tried adding another case to handle DataType::Binary, but it doesn't seem as simple as that.