polars icon indicating copy to clipboard operation
polars copied to clipboard

Add pl.Expr.min_by/max_by

Open orlp opened this issue 1 year ago • 7 comments
trafficstars

It would be nice to be able to select a minimum/maximum value, using a different expression to determine the order. This is technically already possible today with either arg_max / arg_min + .get (which isn't streaming), or top_k_by / bottom_k_by with k = 1 (which is a bit inefficient).

An example would be house_prices.select(pl.col.bedrooms.max_by(pl.col.price)) which would give the number of bedrooms of the most expensive house.

orlp avatar Aug 24 '24 16:08 orlp

I get the idea which is very nice but I find the api rather unintuitive and hard to read / unterstand.

What should it mean: "select the bedrooms max_by the price"? It could mean a lot of thing but it is not clear that it means: "get me the bedrooms for the row with the max price"

Also confusing / unclear min/max are usually "well defined" and deterministic. What happens if I have multiple rows with the same max price having different bedrooms?? Will I get a random one?

Julian-J-S avatar Aug 24 '24 21:08 Julian-J-S

You can already do that with sort_by:

house_prices.select(pl.col("bedrooms").sort_by("prices").first())

oreilles avatar Aug 26 '24 07:08 oreilles

I think the issue is that sort/sort_by/top_k are "slow":

import polars as pl

n = 25_000_000

house_prices = pl.DataFrame({
    "name": ["A", "B", "C", "D", "E"],
    "bedrooms": [4, 2, 5, 3, 5],
    "price": [1, 3, 1, 3, 2]
}).sample(n=n, with_replacement=True, shuffle=True)
# Elapsed time: 0.07851 seconds
df.select(pl.all().get(pl.col.price.arg_max()))

# shape: (1, 3)
# ┌──────┬──────────┬───────┐
# │ name ┆ bedrooms ┆ price │
# │ ---  ┆ ---      ┆ ---   │
# │ str  ┆ i64      ┆ i64   │
# ╞══════╪══════════╪═══════╡
# │ D    ┆ 3        ┆ 3     │
# └──────┴──────────┴───────┘

# Elapsed time: 1.24436 seconds
df.select(pl.all().top_k_by(pl.col.price, 1))
# Elapsed time: 1.02321 seconds
df.select(pl.all().sort_by(pl.col.price, descending=True).first())

cmdlineluser avatar Aug 26 '24 08:08 cmdlineluser

df.select(pl.all().get(pl.col.price.arg_max()))

Oh, that's even better.

oreilles avatar Aug 26 '24 09:08 oreilles

df.select(pl.all().get(pl.col.price.arg_max()))

Oh, that's even better.

This method is efficient for the in-memory engine but will be less efficient for the streaming engine.

orlp avatar Aug 26 '24 12:08 orlp

df.select(pl.all().get(pl.col.price.arg_max()))

Oh, that's even better.

This method is efficient for the in-memory engine but will be less efficient for the streaming engine.

Sorry for the ignorance, but where can I see if an operation is suitable for the new streamong engine?

barak1412 avatar Aug 27 '24 18:08 barak1412

Sorry for the ignorance, but where can I see if an operation is suitable for the new streamong engine?

Nowhere, currently. But you can infer it, if some operation needs access to all the data it tends to not work well with streaming.

orlp avatar Aug 27 '24 19:08 orlp

Just wanting to add that the min_by / max_by naming convention has been adopted by BigQuery, DuckDB, Snowflake, PySpark, and Databricks. I would be in support of the proposed naming as it follows the principle of least astonishment.

@Julian-J-S

min/max are usually "well defined" and deterministic. What happens if I have multiple rows with the same max price having different bedrooms?? Will I get a random one?

It looks to be non-deterministic (or use the first value found) in all of the linked pages. I suspect it will be the same here.

As an aside, arg_min / arg_max are also currently non-deterministic or use the first value found. E.g.,

(
    pl.DataFrame({"a": [20, 10, 30, 10, 30]})
    .select(
        min=pl.col("a").arg_min(),
        max=pl.col("a").arg_max(),
    )
)
# shape: (1, 2)
# ┌─────┬─────┐
# │ min ┆ max │
# │ --- ┆ --- │
# │ u32 ┆ u32 │
# ╞═════╪═════╡
# │ 1   ┆ 2   │
# └─────┴─────┘

henryharbeck avatar Jan 26 '25 08:01 henryharbeck