polars icon indicating copy to clipboard operation
polars copied to clipboard

Extract columns using capture groups (`extract`/`extract_all`)

Open BenJeau opened this issue 2 years ago • 1 comments

Polars version checks

  • [X] I have checked that this issue has not already been reported.

  • [X] I have confirmed this bug exists on the latest version of polars.

Issue Description

extract_all does not work like pandas' extract_all in the sense that capture groups are not respected when there are more than one, but using extract multiple times work (but I would rather not run a regex expression multiple times).

In pandas you can leverage regular expression capture groups to extract fields (https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.str.extract.html).

Reproducible Example

use polars::prelude::*;

fn main() -> Result<()> {
    let df = df!("lines" => ["192.135.453.34 to 234.234.2.3 bytes 123", "2.23.4.5 to 123.4.123.4 bytes 12"])?;

    let df = df.lazy().select([
        col("lines").str().extract_all(r"(\b\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}\b) to (\b\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}\b) bytes (\d{1,5})")
    ]).collect().unwrap().lazy().select([
        col("lines").arr().lengths(),
        // col("lines").arr().get(1).alias("srcip"),
        // col("lines").arr().get(2).alias("dstip"),
        // col("lines").arr().get(3).alias("bytes"),
    ]);

    println!("{:?}", df.collect().unwrap());

    Ok(())
}

Will print:

shape: (2, 1)
┌───────┐
│ lines │
│ ---   │
│ u32   │
╞═══════╡
│ 1     │
├╌╌╌╌╌╌╌┤
│ 1     │
└───────┘

Expected Behavior

Should print:

shape: (2, 3)
┌───────┐
│ lines │
│ ---   │
│ u32   │
╞═══════╡
│ 3     │
├╌╌╌╌╌╌╌┤
│ 3     │
└───────┘

Installed Versions

polars = {version = "0.23.2", features = ["lazy", "strings", "list", "list_eval"]}

BenJeau avatar Sep 06 '22 23:09 BenJeau

Using pandas I would do the following:

import pandas as pd

df = pd.DataFrame(["192.135.453.34 to 234.234.2.3 bytes 123", "2.23.4.5 to 123.4.123.4 bytes 12"])
df[["srcip", "dstip", "bytes"]] = df[0].str.extract(r"(\b\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}\b) to (\b\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}\b) bytes (\d{1,5})")
print(df)

Will print:

                                         0           srcip        dstip bytes
0  192.135.453.34 to 234.234.2.3 bytes 123  192.135.453.34  234.234.2.3   123
1         2.23.4.5 to 123.4.123.4 bytes 12        2.23.4.5  123.4.123.4    12

This is also my first time using polars, so I may be doing something wrong, please let me know!

BenJeau avatar Sep 06 '22 23:09 BenJeau

See also #3775. I think assigning names to the columns when returning multiple values is the main roadblock at the moment, and Polars doesn't use the Pandas-style column/multi-column assignment (in the documentation, at least), prefering df.select() and df.with_columns.

Expr.arr.split_exact()outputs in structs instead, and the doc reccommends using unnesting to get the values, so that's always an option for implementing this.

sm-Fifteen avatar Oct 11 '22 14:10 sm-Fifteen

@stinodego @BenJeau I think this issue can be closed, since it has been added here: https://github.com/pola-rs/polars/pull/10179

ion-elgreco avatar Aug 11 '23 17:08 ion-elgreco