polars icon indicating copy to clipboard operation
polars copied to clipboard

Dict/Hashmap lookup expression

Open sm-Fifteen opened this issue 2 years ago • 5 comments

Describe your feature request

Let's say I have a dataset like this:

import polars as pl

grades = pl.DataFrame(
    {
        "student": ["bas", "laura", "tim", "jenny", "bas", "laura", "tim", "jenny"],
        "class": ["MAT-150", "MAT-150", "MAT-210", "MAT-600", "COM-200", "COM-205", "COM-430", "COM-200"],
        "test_score": [10, 5, 6, 8, 7, 6, 10, 5],
        "test_max": [10, 10, 12, 10, 12, 10, 15, 12],
    }
)

And that I want to map class with their respective subject matter, so I can compare grades per subject instead of per class:

class_subject = {
    "MAT-150": "Mathematics",
    "MAT-210": "Mathematics",
    "MAT-430": "Mathematics",

    "COM-200": "Programming",
    "COM-205": "Programming",
    "COM-600": "Programming",
}

With Pandas, I can use Series.map to create a series that maps the contents of the initial column with the key of a Python dictionary and contains the value.

Using Polars, that's doable, but a fair amount more involved, because I need to cast both columns as Categorical and perform a join within the same context manager:

with pl.StringCache():
    class_subject_df = pl.from_records(list(class_subject.items()), columns=['class_code', 'class_subject'], orient='row')
    class_subject_df = class_subject_df.with_column(pl.col('class_code').cast(pl.Categorical))

    grades = grades.with_column(pl.col("class").cast(pl.Categorical))
    grades = grades.join(class_subject_df, left_on='class', right_on='class_code')

A new expression method, maybe something like Expr.lookup(map: dict[str | int, ...]) would make this sort of operation doable in a single step. An extra argument, like lookup(map, on_missing: Literal['omit','null','error']) could also be useful to specify the behavior when the hashmap does not contain anything. Pandas instead relies on the use of DefaultDict and the user running a second pass to filter out the NaNs that were inserted for missing entries.

If this is restricted to dicts and not lambda functions, it should be possible to copy the dict into a Rust HashMap and perform the operation without needing Python-owned resources.

sm-Fifteen avatar Jun 23 '22 16:06 sm-Fifteen

We have this functionality. This is a join. I don't see much benefit of adding more code we must maintain, binary bloat etc, for something that maps column a to column b.

ritchie46 avatar Jun 23 '22 18:06 ritchie46

We have this functionality. This is a join. I don't see much benefit of adding more code we must maintain, binary bloat etc, for something that maps column a to column b.

Join is unwieldy for this operation, since can't be expressed in-line on a select/with_column.

It's possible to perform this as an expression, but since ~~Expr.map() materializes all the selected columns and~~ Series.apply() invokes a user-defined python function, this is liable to poor performance on larger datasets:

grades.with_column(pl.col("class").map(lambda series: series.apply(lambda x: class_subject.get(x))))

EDIT: It turns out that you can use Expr.apply() directly, the doc just makes it seem like apply() should be reserved for groupby contexts.

csv_data = grades.with_column(pl.col("class").apply(class_subject.get))

sm-Fifteen avatar Jun 23 '22 19:06 sm-Fifteen

EDIT: It turns out that you can use Expr.apply() directly, the doc just makes it seem like apply() should be reserved for groupby contexts.

Yeap, apply is elementwise (in the select context).

ritchie46 avatar Jun 23 '22 20:06 ritchie46

Yeap, apply is elementwise (in the select context). What???

Arengard avatar Jun 28 '22 13:06 Arengard

This section from the User Guide might help:

apply works on the smallest logical elements for that operation. That is: select context -> single elements groupby context -> single groups

cbilot avatar Jun 28 '22 13:06 cbilot

@sm-Fifteen This is now supported:

In [6]: grades.with_columns(pl.col("class").map_dict(class_subject, default="No Known Class").alias("class_code"))
Out[6]: 
shape: (8, 5)
┌─────────┬─────────┬────────────┬──────────┬────────────────┐
│ student ┆ class   ┆ test_score ┆ test_max ┆ class_code     │
│ ---     ┆ ---     ┆ ---        ┆ ---      ┆ ---            │
│ str     ┆ str     ┆ i64        ┆ i64      ┆ str            │
╞═════════╪═════════╪════════════╪══════════╪════════════════╡
│ bas     ┆ MAT-150 ┆ 10         ┆ 10       ┆ Mathematics    │
│ laura   ┆ MAT-150 ┆ 5          ┆ 10       ┆ Mathematics    │
│ tim     ┆ MAT-210 ┆ 6          ┆ 12       ┆ Mathematics    │
│ jenny   ┆ MAT-600 ┆ 8          ┆ 10       ┆ No Known Class │
│ bas     ┆ COM-200 ┆ 7          ┆ 12       ┆ Programming    │
│ laura   ┆ COM-205 ┆ 6          ┆ 10       ┆ Programming    │
│ tim     ┆ COM-430 ┆ 10         ┆ 15       ┆ No Known Class │
│ jenny   ┆ COM-200 ┆ 5          ┆ 12       ┆ Programming    │
└─────────┴─────────┴────────────┴──────────┴────────────────┘

Closed by https://github.com/pola-rs/polars/pull/5899.

ghuls avatar Feb 10 '23 14:02 ghuls

@ghuls : Oh, wow, thanks, that's great! I'd actually given up on this, but for my use cases, it's actually a huge improvements in ergonomics and readability.

sm-Fifteen avatar Feb 10 '23 14:02 sm-Fifteen

Since I found this feature via this thread, I'd like to mention that from 0.19.16 on this method is called "replace"

sezanzeb avatar Dec 07 '23 17:12 sezanzeb