polars icon indicating copy to clipboard operation
polars copied to clipboard

Add a `.parse_str()` method that acts like `read_csv()` on DataFrames

Open DrMaphuse opened this issue 2 years ago • 9 comments

Problem Description

Casting from Utf8 to Boolean currently raises ArrowErrorException: NotYetImplemented("Casting from LargeUtf8 to Boolean not supported").

Seeing as this might already be implemented as part of the csv reader, it would be nice to be able to use it on Utf8-columns in DataFrames as well.

DrMaphuse avatar Sep 08 '22 09:09 DrMaphuse

Isn't that a string search?

pl.DataFrame({
    "a": ["True", "0", "False"]
}).select([
    pl.col("a").str.contains("True|true|1")
])

ritchie46 avatar Sep 08 '22 10:09 ritchie46

Yes, that would be one way to do it. Another would be pl.when().then().

My request refers more to the out-of-the-box capabilities of cast when it comes to standard scenarios like this. Datetime strings are a similar case.

Concrete use case: When I have a dict of dtypes (schema), sometimes I just want to do df.with_columns([pl.col(c).cast(t) for c, t in schema.items()]), without having to write extra code for Utf8 > Bool and Utf8 > Datetime columns.

DrMaphuse avatar Sep 08 '22 11:09 DrMaphuse

I don't think a cast should be the way to solve that problem. I think you should be explicit. It also doesn't extrapolate behavior from any of the target languages we use:

>>> bool("False")
True
>>> bool("false")
True

And in rust such a cast is not even allowed and you are required to do a string comparison. I think the latter is best.

ritchie46 avatar Sep 08 '22 11:09 ritchie46

I get your point. It would still be extremely useful to have a way to programatically access the "smart" string parsing functionalities of the csv reader - even if we don't strictly refer to it as a "cast" anymore.

DrMaphuse avatar Sep 08 '22 13:09 DrMaphuse

def cast_str_to_bool(name: str): -> pl.Expr
    return pl.col(name).str.contains(r'^"?true"?$/gi')

ritchie46 avatar Sep 09 '22 07:09 ritchie46

@DrMaphuse : could you be more specific on why it is extremely useful? Do you find that alternatives, such as posted above by Ritchie, to not work for your use case? Or is the performance blocking you, and are you hoping the parser of the csv reader can alleviate that? That would help us to better understand your use case.

zundertj avatar Sep 11 '22 06:09 zundertj

I use Polars for automated data pipelines and often need to convert data ingested in string format into known schemata. If I were to round-trip the data to CSV, it would take 2 lines of code: df.write_csv('file.csv') and df=pl.read_csv('file.csv', schema=known_schema). If I want to achieve the same without the round trip to CSV, I need to write custom functions for bool, datetime and other more sophisticated string conversions. Sure, I can pack them all together into a single function and get a 2-liner again, but it seems less that ideal to me to try and replicate something that already exists somewhere in the csv reader.

DrMaphuse avatar Sep 11 '22 07:09 DrMaphuse

Ok, I understand that using the csv parser in its current form is not ideal in your use case. Coming back to the question in my previous post, what is wrong with the alternative Ritchie posted for your use case? It is a straightforward, flexible and performant solution in my opinion.

zundertj avatar Sep 12 '22 20:09 zundertj

Utility of a csv-equivalent .parse_str() method

I am going to try to add a little more detail to my suggestion here after thinking about it.

What I am suggesting (let's call it the .parse_str() method) would yield several advantages:

  • it would be intuitive as an equivalent to the CSV parser
  • it would reduce the necessity for custom functions on the user's end (less code, less work)
  • it would provide standardization of identical use cases
  • error potential through individual, hand-rolled solutions would be reduced (this one is big - string parsing is a very messy business that benefits massively from standardization)
  • duplicated functionality would be avoided (assuming there is a way to pass string columns to the parser efficiently)

I think ritchie's second suggestion showcases what I mean to an extent, because it doesn't work, and it doesn't cover Nulls:

def cast_str_to_bool(name: str):
    return pl.col(name).str.contains(r'^"?true"?$/gi')

str_df = pl.DataFrame(["true", "false", ""] * 1_000_000)

print(str_df.select(cast_str_to_bool('column_0')).slice(0, 3))
shape: (3, 1)
┌──────────┐
│ column_0 │
│ ---      │
│ bool     │
╞══════════╡
│ false    │
├╌╌╌╌╌╌╌╌╌╌┤
│ false    │
├╌╌╌╌╌╌╌╌╌╌┤
│ false    │
└──────────┘

Performance comparison

Just for the heck of it, I took the liberty of abusing the csv parser to parse a boolean string column in-memory, and it seems to be faster than the best hand-rolled solution that I could come up with that also covers Null (although for sure both of those things can be done much better):

from io import StringIO
str_df = pl.DataFrame(["true", "false", ""] * 1_000_000)
%%timeit
pure_str = str_df.get_column("column_0").str.concat("\n")[0]
str_obj = StringIO(pure_str)
parsed_csv = pl.read_csv(str_obj, has_header=False)

142 ms ± 7.2 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

%%timeit
parsed_pl = str_df.select(
    pl.when(pl.col("column_0").is_in(["true", "True", "TRUE"]))
    .then(True)
    .when(pl.col("column_0").is_in(["false", "False", "FALSE"]))
    .then(False)
    .otherwise(None)
    .alias('column_1')
)

183 ms ± 1.37 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

# check for equality (parsed_csv is missing the last row)
parsed_csv.frame_equal(parsed_pl[:-1])

True

Side notes

Null at the end of a CSV

As an aside, I''ve noticed that there is still an issue with Nulls in the csv parser where it drops the last row if it is empty / Null:

pure_str[-10:]

'rue\nfalse\n'

parsed_csv[-1:]
shape: (1, 1)
┌──────────┐
│ column_1 │
│ ---      │
│ bool     │
╞══════════╡
│ false    │
└──────────┘

Discrepancy in column indexing between pl.read_csv() and pl.DataFrame()

I've noticed that the DataFrame constructor created 0-indexed columns starting with 'column_0' by default, whereas the csv parser starts at 'column_1'. I don't know if this is by design - at first glance, it seems inconsistent.

DrMaphuse avatar Sep 13 '22 08:09 DrMaphuse