polars
polars copied to clipboard
Extract columns using capture groups (`extract`/`extract_all`)
Polars version checks
-
[X] I have checked that this issue has not already been reported.
-
[X] I have confirmed this bug exists on the latest version of polars.
Issue Description
extract_all
does not work like pandas' extract_all
in the sense that capture groups are not respected when there are more than one, but using extract
multiple times work (but I would rather not run a regex expression multiple times).
In pandas you can leverage regular expression capture groups to extract fields (https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.str.extract.html).
Reproducible Example
use polars::prelude::*;
fn main() -> Result<()> {
let df = df!("lines" => ["192.135.453.34 to 234.234.2.3 bytes 123", "2.23.4.5 to 123.4.123.4 bytes 12"])?;
let df = df.lazy().select([
col("lines").str().extract_all(r"(\b\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}\b) to (\b\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}\b) bytes (\d{1,5})")
]).collect().unwrap().lazy().select([
col("lines").arr().lengths(),
// col("lines").arr().get(1).alias("srcip"),
// col("lines").arr().get(2).alias("dstip"),
// col("lines").arr().get(3).alias("bytes"),
]);
println!("{:?}", df.collect().unwrap());
Ok(())
}
Will print:
shape: (2, 1)
┌───────┐
│ lines │
│ --- │
│ u32 │
╞═══════╡
│ 1 │
├╌╌╌╌╌╌╌┤
│ 1 │
└───────┘
Expected Behavior
Should print:
shape: (2, 3)
┌───────┐
│ lines │
│ --- │
│ u32 │
╞═══════╡
│ 3 │
├╌╌╌╌╌╌╌┤
│ 3 │
└───────┘
Installed Versions
polars = {version = "0.23.2", features = ["lazy", "strings", "list", "list_eval"]}
Using pandas I would do the following:
import pandas as pd
df = pd.DataFrame(["192.135.453.34 to 234.234.2.3 bytes 123", "2.23.4.5 to 123.4.123.4 bytes 12"])
df[["srcip", "dstip", "bytes"]] = df[0].str.extract(r"(\b\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}\b) to (\b\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}\b) bytes (\d{1,5})")
print(df)
Will print:
0 srcip dstip bytes
0 192.135.453.34 to 234.234.2.3 bytes 123 192.135.453.34 234.234.2.3 123
1 2.23.4.5 to 123.4.123.4 bytes 12 2.23.4.5 123.4.123.4 12
This is also my first time using polars, so I may be doing something wrong, please let me know!
See also #3775. I think assigning names to the columns when returning multiple values is the main roadblock at the moment, and Polars doesn't use the Pandas-style column/multi-column assignment (in the documentation, at least), prefering df.select()
and df.with_columns
.
Expr.arr.split_exact()
outputs in structs instead, and the doc reccommends using unnesting to get the values, so that's always an option for implementing this.
@stinodego @BenJeau I think this issue can be closed, since it has been added here: https://github.com/pola-rs/polars/pull/10179