Add methods `enumerate` and `list.enumerate` that work like python `enumerate`
Description
Postgres has a nice feature that allows you to specify WITH ORDINALITY after a function call to get an index / row number of the function results set. The best docs I can find on it are here.
By way of example:
con = "postgresql://..."
query = """
SELECT CAST(observation_date AS DATE), date_num
FROM generate_series(
-- start, stop, step
current_date - interval '1 week',
current_date - interval '1 day',
interval '1 day'
) WITH ORDINALITY AS d(observation_date, date_num)
"""
print(pl.read_database_uri(query, con))
# shape: (7, 2)
# ┌──────────────────┬──────────┐
# │ observation_date ┆ date_num │
# │ --- ┆ --- │
# │ date ┆ i64 │
# ╞══════════════════╪══════════╡
# │ 2024-01-07 ┆ 1 │
# │ 2024-01-08 ┆ 2 │
# │ 2024-01-09 ┆ 3 │
# │ 2024-01-10 ┆ 4 │
# │ 2024-01-11 ┆ 5 │
# │ 2024-01-12 ┆ 6 │
# │ 2024-01-13 ┆ 7 │
# └──────────────────┴──────────┘
explode_query = """
WITH df(num, val) AS (VALUES
(1, array['a', 'c', 'e']),
(2, array['b', 'd', 'f'])
)
SELECT num, s.val, s.array_index
-- `unnest` is like polars `explode`
FROM df, UNNEST(val) WITH ORDINALITY AS s(val, array_index)
"""
print(pl.read_database_uri(explode_query, con))
# shape: (6, 3)
# ┌─────┬─────┬─────────────┐
# │ num ┆ val ┆ array_index │
# │ --- ┆ --- ┆ --- │
# │ i32 ┆ str ┆ i64 │
# ╞═════╪═════╪═════════════╡
# │ 1 ┆ a ┆ 1 │
# │ 1 ┆ c ┆ 2 │
# │ 1 ┆ e ┆ 3 │
# │ 2 ┆ b ┆ 1 │
# │ 2 ┆ d ┆ 2 │
# │ 2 ┆ f ┆ 3 │
# └─────┴─────┴─────────────┘
I think this is a compelling feature when combined with some sort of polars *_range(s) function, and even more so with explode, to get the index of the element along with its value.
The current way to do get the same results as the explode_query in polars would be
df = pl.DataFrame({
"num": [1, 2],
"val": [["a", "c", "e"], ["b", "d", "f"]]
})
# One way - add the index before and explode together
df.with_columns(pl.int_ranges(0, pl.col("val").list.len()).alias("array_index")).explode("val", "array_index")
# Another way - add the index after the explode
# the array_index is somewhat lost - this depends on the explode returning the elements in order
df.explode("val").with_columns(pl.int_range(pl.count()).over("num").alias("array_index"))
# Both return the below
# shape: (6, 3)
# ┌─────┬─────┬─────────────┐
# │ num ┆ val ┆ array_index │
# │ --- ┆ --- ┆ --- │
# │ i64 ┆ str ┆ i64 │
# ╞═════╪═════╪═════════════╡
# │ 1 ┆ a ┆ 0 │
# │ 1 ┆ c ┆ 1 │
# │ 1 ┆ e ┆ 2 │
# │ 2 ┆ b ┆ 0 │
# │ 2 ┆ d ┆ 1 │
# │ 2 ┆ f ┆ 2 │
# └─────┴─────┴─────────────┘
The only difference vs postgres being the 0 index in polars, which I'm definitely not suggesting to change.
Edit: After some discussion below, I think adding enumerate and list.enumerate methods should cover what I'm after. Proposed usage/behaviour in my comment below.
~~The request I'm making is to add a parameter to the functions, which would change the return value to a struct and include an index column. Details and names can be discussed if the request is accepted, but something like~~
I had previously checked to see if Polars had an enumerate() function.
(Although in this case, I guess it would be .list.enumerate())
df.with_columns(
pl.col("val").list.eval(
pl.struct(
index = pl.cum_count(),
value = pl.element()
)
)
)
# shape: (2, 2)
# ┌─────┬─────────────────────────────┐
# │ num ┆ val │
# │ --- ┆ --- │
# │ i64 ┆ list[struct[2]] │
# ╞═════╪═════════════════════════════╡
# │ 1 ┆ [{1,"a"}, {2,"c"}, {3,"e"}] │
# │ 2 ┆ [{1,"b"}, {2,"d"}, {3,"f"}] │
# └─────┴─────────────────────────────┘
Just thought I'd mention it as it could be useful as a general addition instead of special casing .explode
@cmdlineluser - True, this is a lot like python's enumerate. Probably also a more recognisable / discoverable name than "with ordinality", haha.
Thank you very much for the code snippet and the suggestion - definitely on board with more generalised and composable, rather than special cases for only certain functions.
You inspired me to do a basic implementation!
In order to get a list of structs (rather than a struct of lists), list.eval also does the trick.
That way both enumerates return a struct.
Not really sold on which is better as a default (list of structs or struct of lists) though, will give it some more thought.
def enumerate(self, name: str = "index") -> pl.Expr:
"""Args: name (str, optional): Name of the index column. Defaults to "index"."""
return pl.struct(
pl.int_range(pl.count()).alias(name),
self,
).alias(self.meta.output_name())
def list_enumerate(self, name: str = "index") -> pl.Expr:
return pl.struct(
pl.int_ranges(0, self.list.len()).alias(name),
self,
).alias(self.meta.output_name())
pl.Expr.enumerate = enumerate
# Can't figure out how to monkey patch this onto the list namespace, but not the point
pl.Expr.list_enumerate = list_enumerate
df = pl.DataFrame({"num": [1, 2], "val": [["a", "c", "e"], ["b", "d", "f"]]})
print(
df.select(
"num",
pl.col("val").enumerate().alias("plain_enumerate"),
pl.col("val").list_enumerate().alias("list_enumerate"),
pl.col("val").list.eval(pl.element().enumerate()).alias("eval_enumerate"),
)
# then to get the data into a completely flat format, do one of these
# .unnest("list_enumerate").explode("index", "val")
# the "val" col name is lost in the "eval_enumerate" because of `pl.element()` - will open an issue
# .explode("eval_enumerate").unnest("eval_enumerate")
)
# shape: (2, 4)
# ┌─────┬─────────────────────┬─────────────────────────────┬─────────────────────────────┐
# │ num ┆ plain_enumerate ┆ list_enumerate ┆ eval_enumerate │
# │ --- ┆ --- ┆ --- ┆ --- │
# │ i64 ┆ struct[2] ┆ struct[2] ┆ list[struct[2]] │
# ╞═════╪═════════════════════╪═════════════════════════════╪═════════════════════════════╡
# │ 1 ┆ {0,["a", "c", "e"]} ┆ {[0, 1, 2],["a", "c", "e"]} ┆ [{0,"a"}, {1,"c"}, {2,"e"}] │
# │ 2 ┆ {1,["b", "d", "f"]} ┆ {[0, 1, 2],["b", "d", "f"]} ┆ [{0,"b"}, {1,"d"}, {2,"f"}] │
# └─────┴─────────────────────┴─────────────────────────────┴─────────────────────────────┘
Giving what I wrote earlier some more thought:
- like
df.with_row_indexand python's builtinenumerate, it would be worthwhile adding anoffsetparameter to start at a number other than 0 - the plain
enumeratemay not really seem super useful on its own, but does offer good utility when applied insidelist.eval