polars icon indicating copy to clipboard operation
polars copied to clipboard

Add methods `enumerate` and `list.enumerate` that work like python `enumerate`

Open henryharbeck opened this issue 2 years ago • 3 comments

Description

Postgres has a nice feature that allows you to specify WITH ORDINALITY after a function call to get an index / row number of the function results set. The best docs I can find on it are here.

By way of example:

con = "postgresql://..."

query = """
SELECT CAST(observation_date AS DATE), date_num
FROM generate_series(
	-- start, stop, step
	current_date - interval '1 week',
	current_date - interval '1 day',
	interval '1 day'
) WITH ORDINALITY AS d(observation_date, date_num)
"""
print(pl.read_database_uri(query, con))

# shape: (7, 2)
# ┌──────────────────┬──────────┐
# │ observation_date ┆ date_num │
# │ ---              ┆ ---      │
# │ date             ┆ i64      │
# ╞══════════════════╪══════════╡
# │ 2024-01-07       ┆ 1        │
# │ 2024-01-08       ┆ 2        │
# │ 2024-01-09       ┆ 3        │
# │ 2024-01-10       ┆ 4        │
# │ 2024-01-11       ┆ 5        │
# │ 2024-01-12       ┆ 6        │
# │ 2024-01-13       ┆ 7        │
# └──────────────────┴──────────┘

explode_query = """
WITH df(num, val) AS (VALUES
	(1, array['a', 'c', 'e']),
	(2, array['b', 'd', 'f'])
)

SELECT num, s.val, s.array_index
-- `unnest` is like polars `explode`
FROM df, UNNEST(val) WITH ORDINALITY AS s(val, array_index)
"""
print(pl.read_database_uri(explode_query, con))

# shape: (6, 3)
# ┌─────┬─────┬─────────────┐
# │ num ┆ val ┆ array_index │
# │ --- ┆ --- ┆ ---         │
# │ i32 ┆ str ┆ i64         │
# ╞═════╪═════╪═════════════╡
# │ 1   ┆ a   ┆ 1           │
# │ 1   ┆ c   ┆ 2           │
# │ 1   ┆ e   ┆ 3           │
# │ 2   ┆ b   ┆ 1           │
# │ 2   ┆ d   ┆ 2           │
# │ 2   ┆ f   ┆ 3           │
# └─────┴─────┴─────────────┘

I think this is a compelling feature when combined with some sort of polars *_range(s) function, and even more so with explode, to get the index of the element along with its value.

The current way to do get the same results as the explode_query in polars would be


df = pl.DataFrame({
    "num": [1, 2],
    "val": [["a", "c", "e"], ["b", "d", "f"]]
})

# One way - add the index before and explode together
df.with_columns(pl.int_ranges(0, pl.col("val").list.len()).alias("array_index")).explode("val", "array_index")

# Another way - add the index after the explode
# the array_index is somewhat lost - this depends on the explode returning the elements in order
df.explode("val").with_columns(pl.int_range(pl.count()).over("num").alias("array_index"))

# Both return the below
# shape: (6, 3)
# ┌─────┬─────┬─────────────┐
# │ num ┆ val ┆ array_index │
# │ --- ┆ --- ┆ ---         │
# │ i64 ┆ str ┆ i64         │
# ╞═════╪═════╪═════════════╡
# │ 1   ┆ a   ┆ 0           │
# │ 1   ┆ c   ┆ 1           │
# │ 1   ┆ e   ┆ 2           │
# │ 2   ┆ b   ┆ 0           │
# │ 2   ┆ d   ┆ 1           │
# │ 2   ┆ f   ┆ 2           │
# └─────┴─────┴─────────────┘

The only difference vs postgres being the 0 index in polars, which I'm definitely not suggesting to change.

Edit: After some discussion below, I think adding enumerate and list.enumerate methods should cover what I'm after. Proposed usage/behaviour in my comment below.

~~The request I'm making is to add a parameter to the functions, which would change the return value to a struct and include an index column. Details and names can be discussed if the request is accepted, but something like~~

henryharbeck avatar Jan 14 '24 11:01 henryharbeck

I had previously checked to see if Polars had an enumerate() function.

(Although in this case, I guess it would be .list.enumerate())

df.with_columns(
   pl.col("val").list.eval(
      pl.struct(
         index = pl.cum_count(),
         value = pl.element()
      )
   )
)

# shape: (2, 2)
# ┌─────┬─────────────────────────────┐
# │ num ┆ val                         │
# │ --- ┆ ---                         │
# │ i64 ┆ list[struct[2]]             │
# ╞═════╪═════════════════════════════╡
# │ 1   ┆ [{1,"a"}, {2,"c"}, {3,"e"}] │
# │ 2   ┆ [{1,"b"}, {2,"d"}, {3,"f"}] │
# └─────┴─────────────────────────────┘

Just thought I'd mention it as it could be useful as a general addition instead of special casing .explode

cmdlineluser avatar Jan 14 '24 13:01 cmdlineluser

@cmdlineluser - True, this is a lot like python's enumerate. Probably also a more recognisable / discoverable name than "with ordinality", haha.

Thank you very much for the code snippet and the suggestion - definitely on board with more generalised and composable, rather than special cases for only certain functions.

You inspired me to do a basic implementation!

In order to get a list of structs (rather than a struct of lists), list.eval also does the trick. That way both enumerates return a struct. Not really sold on which is better as a default (list of structs or struct of lists) though, will give it some more thought.

def enumerate(self, name: str = "index") -> pl.Expr:
    """Args: name (str, optional): Name of the index column. Defaults to "index"."""
    return pl.struct(
        pl.int_range(pl.count()).alias(name),
        self,
    ).alias(self.meta.output_name())


def list_enumerate(self, name: str = "index") -> pl.Expr:
    return pl.struct(
        pl.int_ranges(0, self.list.len()).alias(name),
        self,
    ).alias(self.meta.output_name())


pl.Expr.enumerate = enumerate
# Can't figure out how to monkey patch this onto the list namespace, but not the point
pl.Expr.list_enumerate = list_enumerate


df = pl.DataFrame({"num": [1, 2], "val": [["a", "c", "e"], ["b", "d", "f"]]})

print(
    df.select(
        "num",
        pl.col("val").enumerate().alias("plain_enumerate"),
        pl.col("val").list_enumerate().alias("list_enumerate"),
        pl.col("val").list.eval(pl.element().enumerate()).alias("eval_enumerate"),
    )
    # then to get the data into a completely flat format, do one of these
    # .unnest("list_enumerate").explode("index", "val")
    # the "val" col name is lost in the "eval_enumerate" because of `pl.element()` - will open an issue
    # .explode("eval_enumerate").unnest("eval_enumerate")
)

# shape: (2, 4)
# ┌─────┬─────────────────────┬─────────────────────────────┬─────────────────────────────┐
# │ num ┆ plain_enumerate     ┆ list_enumerate              ┆ eval_enumerate              │
# │ --- ┆ ---                 ┆ ---                         ┆ ---                         │
# │ i64 ┆ struct[2]           ┆ struct[2]                   ┆ list[struct[2]]             │
# ╞═════╪═════════════════════╪═════════════════════════════╪═════════════════════════════╡
# │ 1   ┆ {0,["a", "c", "e"]} ┆ {[0, 1, 2],["a", "c", "e"]} ┆ [{0,"a"}, {1,"c"}, {2,"e"}] │
# │ 2   ┆ {1,["b", "d", "f"]} ┆ {[0, 1, 2],["b", "d", "f"]} ┆ [{0,"b"}, {1,"d"}, {2,"f"}] │
# └─────┴─────────────────────┴─────────────────────────────┴─────────────────────────────┘

henryharbeck avatar Jan 14 '24 15:01 henryharbeck

Giving what I wrote earlier some more thought:

  • like df.with_row_index and python's builtin enumerate, it would be worthwhile adding an offset parameter to start at a number other than 0
  • the plain enumerate may not really seem super useful on its own, but does offer good utility when applied inside list.eval

henryharbeck avatar Jan 14 '24 23:01 henryharbeck