polars icon indicating copy to clipboard operation
polars copied to clipboard

Add `pl.read_lines` and `pl.scan_lines` input methods

Open orlp opened this issue 1 year ago • 2 comments

I propose we add the following function (and pl.scan_lines completely analogously):

pl.read_lines(
    source: str | Path | IO[str] | IO[bytes] | bytes,
    *,
    n_rows: int | None = None,
    encoding: str = 'utf8',
    eol_char: str = '\n',
)

This creates a DataFrame/LazyFrame with a single column, line, that contains each line of the file in order as a pl.String datatype.


This can be very handy for parsing ad-hoc file formats, log files, etc. For example to summarize a logfile app.log:

[2017-11-09T02:12:24Z ERROR main] this is an error message
[2017-11-09T02:12:24Z ERROR main] another error occurred
[2017-11-09T02:12:24Z ERROR main] the answer was: 12
[2017-11-10T02:12:24Z DEBUG foo] niii
[2017-11-11T02:12:24Z INFO foo] blabla
[2017-11-11T02:12:24Z WARN foo] scary

We can write the following:

pl.scan_lines("app.log")
    .select(pl.col.line.str.extract_groups(
        r"\[(?<time>\S+) (?<level>\S+) (?<file>\S+)\]"
    ).struct.field("*"))
    .with_columns(time=pl.col.time.str.to_datetime())
    .group_by(pl.col.time.dt.date(), "file", "level")
    .len()
    .sort("time")

and get

┌────────────┬───────┬─────┐
│ time       ┆ level ┆ len │
│ ---        ┆ ---   ┆ --- │
│ date       ┆ str   ┆ u32 │
╞════════════╪═══════╪═════╡
│ 2017-11-09 ┆ ERROR ┆ 3   │
│ 2017-11-10 ┆ DEBUG ┆ 1   │
│ 2017-11-11 ┆ INFO  ┆ 1   │
│ 2017-11-11 ┆ WARN  ┆ 1   │
└────────────┴───────┴─────┘

orlp avatar Oct 10 '24 17:10 orlp

Related to https://github.com/pola-rs/polars/issues/18588

etiennebacher avatar Oct 10 '24 22:10 etiennebacher

@etiennebacher More than related, a dupe - I did search but didn't find it. I'll close it in favor of this one though, as this one has a bit more fleshed out API and motivating example.

orlp avatar Oct 10 '24 23:10 orlp

Hi @orlp, I'm interested in working on this. I took a brief look at some input methods, and I think a simple way of implementing this is reusing other input methods or readers, specifically CsvReader, because fundamentally it also reads lines before parsing CSV rows.

I tried the following to mimic the behaviour of read_lines and it seems okay:

>>> df = pl.read_csv("/tmp/log.log", has_header=False, new_columns=["line"], separator="\n", eol_char="\n")
>>> df
shape: (6, 1)
┌─────────────────────────────────┐
│ line                            │
│ ---                             │
│ str                             │
╞═════════════════════════════════╡
│ [2017-11-09T02:12:24Z ERROR ma… │
│ [2017-11-09T02:12:24Z ERROR ma… │
│ [2017-11-09T02:12:24Z ERROR ma… │
│ [2017-11-10T02:12:24Z DEBUG fo… │
│ [2017-11-11T02:12:24Z INFO foo… │
│ [2017-11-11T02:12:24Z WARN foo… │
└─────────────────────────────────┘
>>> df.select(
...     pl.col.line.str.extract_groups(
...             r"\[(?<time>\S+) (?<level>\S+) (?<file>\S+)\]"
...     ).struct.field("*")
... ).with_columns(
...     time=pl.col.time.str.to_datetime()
... ).group_by(
...     pl.col.time.dt.date(), "file", "level"
... ).len().sort("time")
shape: (4, 4)
┌────────────┬──────┬───────┬─────┐
│ time       ┆ file ┆ level ┆ len │
│ ---        ┆ ---  ┆ ---   ┆ --- │
│ date       ┆ str  ┆ str   ┆ u32 │
╞════════════╪══════╪═══════╪═════╡
│ 2017-11-09 ┆ main ┆ ERROR ┆ 3   │
│ 2017-11-10 ┆ foo  ┆ DEBUG ┆ 1   │
│ 2017-11-11 ┆ foo  ┆ INFO  ┆ 1   │
│ 2017-11-11 ┆ foo  ┆ WARN  ┆ 1   │
└────────────┴──────┴───────┴─────┘

The key is to always set separator to eol_char to ensure that we get only one column per line.

What do you think? If this approach, i.e. calling read_csv/scan_csv or factoring out the read line logic in these methods, sounds okay to you, I can start working on this.

changhc avatar May 28 '25 22:05 changhc

@changhc I want the implementation to be in Rust, not Python. If you internally on the Rust side re-use the CSV reader (at least for an initial proof-of-concept), I would be fine with that. There are more options you should set though, like infer_schema=False.

orlp avatar May 28 '25 22:05 orlp

Okay. So do you mean that you want a new method of polars_python::dataframe::PyDataFrame, e.g. polars_python::dataframe::PyDataFrame::read_lines, that crafts data frames possibly using CsvReader and a new function read_lines on the Python side that calls PyDataFrame.read_lines?

Just to understand this more: what is the benefit of doing this in Rust? Is it mainly for code structure and organisation? Performance-wise I think implementing this purely in Python should be comparable because it will still be calling PyDataFrame.read_csv.

changhc avatar May 28 '25 23:05 changhc

No, I want it even deeper. I want a line source in polars-stream/src/nodes/io_sources/, which can call the CSV implementation internally for now. Everything above that should be done properly, not knowing that deep down below it's using the CSV reader implementation. It should be easy to swap it out for a proper implementation later.

This involves quite a bit of setup to mirror everything for CSV, e.g. a LinesReadOptions struct in polars/crates/polars-io/src/lines.rs and a LazyLinesReader in polars/crates/polars-lazy/src/scan/lines.rs.

orlp avatar May 28 '25 23:05 orlp

I see. Sure. I can do this. If I need to implement all layers instead of one higher level only, I might as well just implement them properly. I'll follow how CsvReader was implemented and reuse some of the logic there, e.g. the reading bytes part.

changhc avatar May 29 '25 06:05 changhc