Figure out better autocleaning comparison

Open paddymul opened this issue 1 year ago • 0 comments

Checks

[X] I have checked that this enhancement has not already been requested

How would you categorize this request. You can select multiple if not sure

Auto Cleaning, Performance

Enhancement Description

polars makes some autocleaning functionality very difficult, particularly comparing original to modfified across different dtypes. This makes it much more difficult to color and add tooltips to the resulting dataframe based on modifications.

pl.DataFrame({'a_raw':["not_parseable", "30"], 'a_cleaned': [None, 30]})
pl.select(pl.col("a_raw").eq("a_cleaned"))

which they shouldn't equal each other because their different types... but you cant do this either

pl.DataFrame({'a_raw': pl.Series(["not_parseable", 30], dtype=pl.Object), 'a_cleaned': [None, 30]})
pl.select(pl.col("a_raw").eq("a_cleaned"))

you can't even do this

pl.DataFrame({'a_raw':["not_parseable", 30], 'a_cleaned': [None, 30]})
pl.select(pl.struct(["a_raw", "a_cleaned"]).map_elements(lambda x: x[0] == x[1]))

Because you can't put an object into a struct

Pseudo Code Implementation

This might require writing some custom expressions. particularly a version of cast that returns a struct with the original

Prior Art

N/A

Feb 14 '24 15:02 paddymul