polars
polars copied to clipboard
Dict/Hashmap lookup expression
Describe your feature request
Let's say I have a dataset like this:
import polars as pl
grades = pl.DataFrame(
{
"student": ["bas", "laura", "tim", "jenny", "bas", "laura", "tim", "jenny"],
"class": ["MAT-150", "MAT-150", "MAT-210", "MAT-600", "COM-200", "COM-205", "COM-430", "COM-200"],
"test_score": [10, 5, 6, 8, 7, 6, 10, 5],
"test_max": [10, 10, 12, 10, 12, 10, 15, 12],
}
)
And that I want to map class
with their respective subject matter, so I can compare grades per subject instead of per class:
class_subject = {
"MAT-150": "Mathematics",
"MAT-210": "Mathematics",
"MAT-430": "Mathematics",
"COM-200": "Programming",
"COM-205": "Programming",
"COM-600": "Programming",
}
With Pandas, I can use Series.map to create a series that maps the contents of the initial column with the key of a Python dictionary and contains the value.
Using Polars, that's doable, but a fair amount more involved, because I need to cast both columns as Categorical and perform a join within the same context manager:
with pl.StringCache():
class_subject_df = pl.from_records(list(class_subject.items()), columns=['class_code', 'class_subject'], orient='row')
class_subject_df = class_subject_df.with_column(pl.col('class_code').cast(pl.Categorical))
grades = grades.with_column(pl.col("class").cast(pl.Categorical))
grades = grades.join(class_subject_df, left_on='class', right_on='class_code')
A new expression method, maybe something like Expr.lookup(map: dict[str | int, ...])
would make this sort of operation doable in a single step. An extra argument, like lookup(map, on_missing: Literal['omit','null','error'])
could also be useful to specify the behavior when the hashmap does not contain anything. Pandas instead relies on the use of DefaultDict and the user running a second pass to filter out the NaNs that were inserted for missing entries.
If this is restricted to dicts and not lambda functions, it should be possible to copy the dict into a Rust HashMap and perform the operation without needing Python-owned resources.
We have this functionality. This is a join
. I don't see much benefit of adding more code we must maintain, binary bloat etc, for something that maps column a to column b.
We have this functionality. This is a
join
. I don't see much benefit of adding more code we must maintain, binary bloat etc, for something that maps column a to column b.
Join is unwieldy for this operation, since can't be expressed in-line on a select/with_column.
It's possible to perform this as an expression, but since ~~Expr.map()
materializes all the selected columns and~~ Series.apply()
invokes a user-defined python function, this is liable to poor performance on larger datasets:
grades.with_column(pl.col("class").map(lambda series: series.apply(lambda x: class_subject.get(x))))
EDIT: It turns out that you can use Expr.apply()
directly, the doc just makes it seem like apply()
should be reserved for groupby contexts.
csv_data = grades.with_column(pl.col("class").apply(class_subject.get))
EDIT: It turns out that you can use Expr.apply() directly, the doc just makes it seem like apply() should be reserved for groupby contexts.
Yeap, apply is elementwise (in the select context).
Yeap, apply is elementwise (in the select context). What???
This section from the User Guide might help:
apply works on the smallest logical elements for that operation. That is: select context -> single elements groupby context -> single groups
@sm-Fifteen This is now supported:
In [6]: grades.with_columns(pl.col("class").map_dict(class_subject, default="No Known Class").alias("class_code"))
Out[6]:
shape: (8, 5)
┌─────────┬─────────┬────────────┬──────────┬────────────────┐
│ student ┆ class ┆ test_score ┆ test_max ┆ class_code │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ str ┆ i64 ┆ i64 ┆ str │
╞═════════╪═════════╪════════════╪══════════╪════════════════╡
│ bas ┆ MAT-150 ┆ 10 ┆ 10 ┆ Mathematics │
│ laura ┆ MAT-150 ┆ 5 ┆ 10 ┆ Mathematics │
│ tim ┆ MAT-210 ┆ 6 ┆ 12 ┆ Mathematics │
│ jenny ┆ MAT-600 ┆ 8 ┆ 10 ┆ No Known Class │
│ bas ┆ COM-200 ┆ 7 ┆ 12 ┆ Programming │
│ laura ┆ COM-205 ┆ 6 ┆ 10 ┆ Programming │
│ tim ┆ COM-430 ┆ 10 ┆ 15 ┆ No Known Class │
│ jenny ┆ COM-200 ┆ 5 ┆ 12 ┆ Programming │
└─────────┴─────────┴────────────┴──────────┴────────────────┘
Closed by https://github.com/pola-rs/polars/pull/5899.
@ghuls : Oh, wow, thanks, that's great! I'd actually given up on this, but for my use cases, it's actually a huge improvements in ergonomics and readability.
Since I found this feature via this thread, I'd like to mention that from 0.19.16 on this method is called "replace"