polars icon indicating copy to clipboard operation
polars copied to clipboard

`selectors` should support slicing columns

Open samukweku opened this issue 9 months ago • 4 comments

Description

Hi team. I would like to suggest adding a slice method to the selectors class, where users can select a slice of columns :

import polars as pl

data = {'City': ['Houston', 'Austin', 'Hoover'],
 'State': ['Texas', 'Texas', 'Alabama'],
 'Name': ['Aria', 'Penelope', 'Niko'],
 'Mango': [4, 10, 90],
 'Orange': [10, 8, 14],
 'Watermelon': [40, 99, 43],
 'Gin': [16, 200, 34],
 'Vodka': [20, 33, 18]}

df = pl.DataFrame(data)

df

┌─────────┬─────────┬──────────┬───────┬────────┬────────────┬─────┬───────┐
│ City    ┆ State   ┆ Name     ┆ Mango ┆ Orange ┆ Watermelon ┆ Gin ┆ Vodka │
│ ---     ┆ ---     ┆ ---      ┆ ---   ┆ ---    ┆ ---        ┆ --- ┆ ---   │
│ str     ┆ str     ┆ str      ┆ i64   ┆ i64    ┆ i64        ┆ i64 ┆ i64   │
╞═════════╪═════════╪══════════╪═══════╪════════╪════════════╪═════╪═══════╡
│ Houston ┆ Texas   ┆ Aria     ┆ 4     ┆ 10     ┆ 40         ┆ 16  ┆ 20    │
│ Austin  ┆ Texas   ┆ Penelope ┆ 10    ┆ 8      ┆ 99         ┆ 200 ┆ 33    │
│ Hoover  ┆ Alabama ┆ Niko     ┆ 90    ┆ 14     ┆ 43         ┆ 34  ┆ 18    │
└─────────┴─────────┴──────────┴───────┴────────┴────────────┴─────┴───────┘

The slicing syntax can be :

df.select(cs.slice('Mango','Vodka')) # alternative - df.select(cs['Mango':'Vodka'])
shape: (3, 5)
┌───────┬────────┬────────────┬─────┬───────┐
│ Mango ┆ Orange ┆ Watermelon ┆ Gin ┆ Vodka │
│ ---   ┆ ---    ┆ ---        ┆ --- ┆ ---   │
│ i64   ┆ i64    ┆ i64        ┆ i64 ┆ i64   │
╞═══════╪════════╪════════════╪═════╪═══════╡
│ 4     ┆ 10     ┆ 40         ┆ 16  ┆ 20    │
│ 10    ┆ 8      ┆ 99         ┆ 200 ┆ 33    │
│ 90    ┆ 14     ┆ 43         ┆ 34  ┆ 18    │
└───────┴────────┴────────────┴─────┴───────┘

samukweku avatar Apr 30 '24 07:04 samukweku

If you know what fields you want, why do you need a selector? Why not use a simple .select("Mango","Vodka")? Or the existing cs.by_name("Mango","Vodka")?

aut0clave avatar Apr 30 '24 10:04 aut0clave

@aut0clave They want to extract the "range of columns" Mango .. Vodka

I believe first/last are the only selectors that are "positional"

>>> cs.first().meta.serialize()
'{"Nth":0}'

There is no .nth() selector, but it would be easy to add:

>>> df.select( pl.Expr.deserialize( io.StringIO("""{"Nth":3}""") ) )
shape: (3, 1)
┌───────┐
│ Mango │
│ ---   │
│ i64   │
╞═══════╡
│ 4     │
│ 10    │
│ 90    │
└───────┘

nth -> column name mapping is done here:

https://github.com/pola-rs/polars/blob/4b23768a7e0b50e39a0c5df8e33321e9b94b6387/crates/polars-plan/src/logical_plan/expr_expansion.rs#L67

From what I can tell, there is nothing that goes the other way, i.e. column name -> nth - which I think would be needed in order to support this at the selector level?

cmdlineluser avatar Apr 30 '24 11:04 cmdlineluser

@cmdlineluser i'd assume there was a way to get the positions of the column names (maybe grab the positions via list.index from python and pass it to the rust end). dont know much about the internal implementation, happy to learn. I'd also suggest, if the team feels like this is a worthwhile addition, that the slicing be limited to column names only (numeric positions should not be supported)

samukweku avatar Apr 30 '24 12:04 samukweku

@cmdlineluser i'd assume there was a way to get the positions of the column names (maybe grab the positions via list.index from python and pass it to the rust end).

FYI: until we are actually evaluating a lazy query plan we may not know the position of all of the columns (eg: expanding a struct, or evaluating earlier selectors). Consequently we can't precompute and pass-down, because it's only at the lower level that we would know the answer (selectors are dynamic, evaluating internally at the point they are invoked) ;)

Offering index-based selection doesn't seem like a bad idea (we currently only support selection by name/dtype and the special cases of first/last, as noted by @cmdlineluser), but would need some internal additions to be possible 🤔

alexander-beedie avatar Apr 30 '24 13:04 alexander-beedie

@cmdlineluser so something like cs.by_position, cs.by_range?

samukweku avatar May 03 '24 23:05 samukweku

@alexander-beedie is the person to ask. (they created selectors :-D)

cmdlineluser avatar May 04 '24 00:05 cmdlineluser

@cmdlineluser so something like cs.by_position, cs.by_range?

Probably cs.by_index, which would take one or more index values, a range, or a slice (as range/slice can be directly expanded into a list of indexes, so internally we just need to handle that). Does need additional low-level support though.

alexander-beedie avatar May 04 '24 07:05 alexander-beedie

FYI, forgot to update this issue, but we do now have a new cs.by_index selector which can take indices and ranges, which gets you some of the way there: https://github.com/pola-rs/polars/pull/16217

alexander-beedie avatar Jul 03 '24 13:07 alexander-beedie

Thanks @alexander-beedie. Looks good. Safe to assume that slicing with labels may be implemented at a future date?

samukweku avatar Jul 03 '24 14:07 samukweku

Thanks @alexander-beedie. Looks good. Safe to assume that slicing with labels may be implemented at a future date?

Probably, but no timeline; the 1.0 (and a few quick point releases to address any related issues) has priority at the moment. And I'm on vacation for the next two weeks ;)

alexander-beedie avatar Jul 03 '24 16:07 alexander-beedie