polars
polars copied to clipboard
Add a `.parse_str()` method that acts like `read_csv()` on DataFrames
Problem Description
Casting from Utf8 to Boolean currently raises ArrowErrorException: NotYetImplemented("Casting from LargeUtf8 to Boolean not supported")
.
Seeing as this might already be implemented as part of the csv reader, it would be nice to be able to use it on Utf8-columns in DataFrames as well.
Isn't that a string search?
pl.DataFrame({
"a": ["True", "0", "False"]
}).select([
pl.col("a").str.contains("True|true|1")
])
Yes, that would be one way to do it. Another would be pl.when().then()
.
My request refers more to the out-of-the-box capabilities of cast
when it comes to standard scenarios like this. Datetime strings are a similar case.
Concrete use case: When I have a dict of dtypes (schema), sometimes I just want to do df.with_columns([pl.col(c).cast(t) for c, t in schema.items()])
, without having to write extra code for Utf8 > Bool and Utf8 > Datetime columns.
I don't think a cast should be the way to solve that problem. I think you should be explicit. It also doesn't extrapolate behavior from any of the target languages we use:
>>> bool("False")
True
>>> bool("false")
True
And in rust such a cast is not even allowed and you are required to do a string comparison. I think the latter is best.
I get your point. It would still be extremely useful to have a way to programatically access the "smart" string parsing functionalities of the csv reader - even if we don't strictly refer to it as a "cast" anymore.
def cast_str_to_bool(name: str): -> pl.Expr
return pl.col(name).str.contains(r'^"?true"?$/gi')
@DrMaphuse : could you be more specific on why it is extremely useful? Do you find that alternatives, such as posted above by Ritchie, to not work for your use case? Or is the performance blocking you, and are you hoping the parser of the csv reader can alleviate that? That would help us to better understand your use case.
I use Polars for automated data pipelines and often need to convert data ingested in string format into known schemata. If I were to round-trip the data to CSV, it would take 2 lines of code: df.write_csv('file.csv')
and df=pl.read_csv('file.csv', schema=known_schema)
. If I want to achieve the same without the round trip to CSV, I need to write custom functions for bool
, datetime
and other more sophisticated string conversions. Sure, I can pack them all together into a single function and get a 2-liner again, but it seems less that ideal to me to try and replicate something that already exists somewhere in the csv reader.
Ok, I understand that using the csv parser in its current form is not ideal in your use case. Coming back to the question in my previous post, what is wrong with the alternative Ritchie posted for your use case? It is a straightforward, flexible and performant solution in my opinion.
Utility of a csv-equivalent .parse_str()
method
I am going to try to add a little more detail to my suggestion here after thinking about it.
What I am suggesting (let's call it the .parse_str()
method) would yield several advantages:
- it would be intuitive as an equivalent to the CSV parser
- it would reduce the necessity for custom functions on the user's end (less code, less work)
- it would provide standardization of identical use cases
- error potential through individual, hand-rolled solutions would be reduced (this one is big - string parsing is a very messy business that benefits massively from standardization)
- duplicated functionality would be avoided (assuming there is a way to pass string columns to the parser efficiently)
I think ritchie's second suggestion showcases what I mean to an extent, because it doesn't work, and it doesn't cover Null
s:
def cast_str_to_bool(name: str):
return pl.col(name).str.contains(r'^"?true"?$/gi')
str_df = pl.DataFrame(["true", "false", ""] * 1_000_000)
print(str_df.select(cast_str_to_bool('column_0')).slice(0, 3))
shape: (3, 1)
┌──────────┐
│ column_0 │
│ --- │
│ bool │
╞══════════╡
│ false │
├╌╌╌╌╌╌╌╌╌╌┤
│ false │
├╌╌╌╌╌╌╌╌╌╌┤
│ false │
└──────────┘
Performance comparison
Just for the heck of it, I took the liberty of abusing the csv parser to parse a boolean string column in-memory, and it seems to be faster than the best hand-rolled solution that I could come up with that also covers Null
(although for sure both of those things can be done much better):
from io import StringIO
str_df = pl.DataFrame(["true", "false", ""] * 1_000_000)
%%timeit
pure_str = str_df.get_column("column_0").str.concat("\n")[0]
str_obj = StringIO(pure_str)
parsed_csv = pl.read_csv(str_obj, has_header=False)
142 ms ± 7.2 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%%timeit
parsed_pl = str_df.select(
pl.when(pl.col("column_0").is_in(["true", "True", "TRUE"]))
.then(True)
.when(pl.col("column_0").is_in(["false", "False", "FALSE"]))
.then(False)
.otherwise(None)
.alias('column_1')
)
183 ms ± 1.37 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
# check for equality (parsed_csv is missing the last row)
parsed_csv.frame_equal(parsed_pl[:-1])
True
Side notes
Null
at the end of a CSV
As an aside, I''ve noticed that there is still an issue with Null
s in the csv parser where it drops the last row if it is empty / Null
:
pure_str[-10:]
'rue\nfalse\n'
parsed_csv[-1:]
shape: (1, 1)
┌──────────┐
│ column_1 │
│ --- │
│ bool │
╞══════════╡
│ false │
└──────────┘
Discrepancy in column indexing between pl.read_csv()
and pl.DataFrame()
I've noticed that the DataFrame constructor created 0-indexed columns starting with 'column_0' by default, whereas the csv parser starts at 'column_1'. I don't know if this is by design - at first glance, it seems inconsistent.