polars icon indicating copy to clipboard operation
polars copied to clipboard

thread 'main' has overflowed its stack when using many when().then()

Open AnatolyBuga opened this issue 2 years ago • 4 comments

Polars version checks

  • [X] I have checked that this issue has not already been reported.

  • [X] I have confirmed this bug exists on the latest version of polars.

Issue Description

Hello Polars Team,

Inspired by this advise, I have a logic which builds a rather long when().then() Expr from a hard coded HashMap. Appologies for a lengthy example, but it was necessary to reproduce. Interestengly, the code works in --release mode.

Not sure if this is a bug or expected. Is my HashMap just too long to chain in when().then() ? If so, is there any workaround/better way to achieve what I need?

Many thanks!

Reproducible Example

use polars::prelude::*;
use std::collections::HashMap;

pub fn issue(){
    let df = df![
            "a" => ["a", "b"],
            "b" => ["c", "d"],
        ].unwrap(); 

    let x: Option<f64> = None;
    let not_yet_implemented = Series::new("null", &[x]).lit().list();
    let never_reached = Series::new("null", &[x]).lit().list();

    // Also Panics on
    // let never_reached = NULL.lit()
    // let not_yet_implemented = NULL.lit()

    let lf = df.lazy().with_column(
        when(col("a").eq("a"))
        .then(rf_rw_map(
            col("b"), weights(), not_yet_implemented)
        )
        .otherwise(never_reached)
    );
    dbg!(lf.collect());
}

pub(crate) fn weights() -> HashMap<String, Expr> {
    // NOTE, in reality its more complicated than [;2], but this will do as an example
    weights_raw().into_iter().map(|(k, v)|(k.to_string(), Series::new("", &[v/100.; 2]).lit().list())).collect()
}
fn rf_rw_map(c: Expr, map: HashMap<String, Expr>, other: Expr) -> Expr {
    // buf is a placeholder
    let mut it = map.into_iter();
    let (k, v) = it.next().unwrap(); //The map will have at least one value

    let mut buf = when(lit::<bool>(false))
        .then(lit::<f64>(0.).list())
        .when(c.clone().apply(
            move |s|{ 
            Ok(s.utf8()?.contains(k.as_str())?.into_series())},
            GetOutput::from_type(DataType::Boolean),
        ))
        .then(v);

    for (k, v) in it {
        buf = buf
            .when(c.clone().apply(
                move |s| Ok(s.utf8()?.contains(k.as_str())?.into_series()),
                GetOutput::from_type(DataType::Boolean),
            ))
            .then(v);
    }
    buf.otherwise(other)
}

/// 22.34
pub fn weights_raw() -> HashMap<&'static str, f64> {
    HashMap::from([
("1"   , 1.2f64),
("2"	, 1.2),
("3"	, 2.0),
("4"	, 2.4),
("5"	, 3.2),
("6" 	, 4.0),
("7"	, 4.8),
("8"	, 6.0),
("9"	    , 7.2),
("10"	, 9.6),
("11"	, 11.2),
("12"	, 12.8),
("13"   , 16.0),
("14"	, 20.0),
("15", 24.8),
("16"	, 30.4),
("17",	36.8),
("18",	36.8),
("19",	36.8),
("20",100.0),
("21",	100.0),
("22"	,100.0),
("23"	,1.2),
("24"	,1.2),
("25"	,2.4),
("26"	,3.2),
("27"	,4.8),
("28",6.4),
("29"	,9.6),
("30"	, 13.6),
("31"	, 17.6),
("32"	, 26.4),
("33", 37.6),
("34"	, 49.6),
("35"	, 60.0),
("36", 72.0),
("37", 84.0),
("38"	, 90.4),
("39",	100.0),
("40"	,100.0),
("41",	100.0),
("42",100.0),
("43",	100.0),
("44"	,100.0),
("45"	, 1.2),
("46"	, 1.2),
("47"	, 1.2),
("48"	, 1.2),
("49"	, 4.0),
("50", 4.0),
    ])
}

Expected Behavior

I wouldn't expect a panic.

Installed Versions

polars = {version="0.23.2", features=["lazy", "list", "strings", "ndarray", "is_in", "dtype-categorical", "csv-file", "partition_by", "concat_str"]}

AnatolyBuga avatar Sep 11 '22 15:09 AnatolyBuga

@ritchie46 this is a pretty common operation, so much so that I've built tools to build these when/then branches myself specifically for this operation. Would it be difficult to build a map function similar to pandas' pandas.Series.map without resorting to a long chain of expressions?

@AnatolyBug I'm not super familiar with Rust, but in python at least, I've gotten around using the long chain sometimes by using pl.col("name").apply(lambda x: weights[x]), where weights is a dict. It's most likely not as performant but it's syntactically much simpler.

mcrumiller avatar Sep 12 '22 13:09 mcrumiller

Let me see if I can help. With code translations that are large, I strongly suggest using an approach that uses a left join rather than generating long when/then/when/then ... chained expressions.

Python Prototype: remap_column

Below is a (working) prototype written as a Python DataFrame method. I realize that you are using Rust, but the translation to Rust should be straightforward (for the Polars steps). The prototype translates code values for a column based on a translation DataFrame.

The code should raise exceptions for input errors, but I've left those as ... for now.

Performance Note: the prototype does not use Python dictionaries (which are slow) or apply. As such, this should be quite performant.

Best of all, it can handle very large translation tables. (We'll see that below).

import polars as pl
from polars.internals.dataframe.frame import DF

def remap_column(
    self: DF,
    translation_df: DF,
) -> DF:
    '''
    Translate values in a column based on a translation table

    Parameters
    ---------
    translation_df
        A two-column DataFrame, one column of which matches the name of the
        column that will be translated.  (The other column must not match an
        existing column name in the original DataFrame.)

    Note: it is the responsiblity of the caller to ensure that dtypes are
    appropriate.
    '''
    if len(translation_df.columns) != 2:
        ...

    col_nm = set(self.columns) & set(translation_df.columns)

    if len(col_nm) != 1:
        ...

    col_nm = col_nm.pop()
    col_nm_prior = col_nm + "_prior"

    col_nm_new_values = (set(translation_df.columns) - set(self.columns)).pop()

    original_col_order = self.columns

    result = (
        self.rename({col_nm: col_nm_prior})
        .join(
            other=translation_df.rename(
                {col_nm: col_nm_prior, col_nm_new_values: col_nm}
            ),
            how="left",
            left_on=col_nm_prior,
            right_on=col_nm_prior,
        )
        .with_column(pl.col(col_nm).fill_null(pl.col(col_nm_prior)))
        .select(original_col_order)
    )

    return self._from_pydf(result._df)


pl.DataFrame.remap_column = remap_column

DataFrame Example

Let's take a look at how this works with a DataFrame. Let's say we have this DataFrame, composed of the lowercase and uppercase letters.

import string
df = pl.DataFrame(
    {
        "col1": string.ascii_letters,
        "col2": string.ascii_letters,
    }
)
df
shape: (52, 2)
┌──────┬──────┐
│ col1 ┆ col2 │
│ ---  ┆ ---  │
│ str  ┆ str  │
╞══════╪══════╡
│ a    ┆ a    │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ b    ┆ b    │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ c    ┆ c    │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ d    ┆ d    │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ ...  ┆ ...  │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ W    ┆ W    │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ X    ┆ X    │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ Y    ┆ Y    │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ Z    ┆ Z    │
└──────┴──────┘

And let's create a translation table intended to translate a lowercase code in col1 to uppercase.

translation_df = pl.DataFrame(
    {
        "col1": string.ascii_lowercase,
        "new_val": string.ascii_uppercase,
    }
)
translation_df
shape: (26, 2)
┌──────┬─────────┐
│ col1 ┆ new_val │
│ ---  ┆ ---     │
│ str  ┆ str     │
╞══════╪═════════╡
│ a    ┆ A       │
├╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ b    ┆ B       │
├╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ c    ┆ C       │
├╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ d    ┆ D       │
├╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ ...  ┆ ...     │
├╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ w    ┆ W       │
├╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ x    ┆ X       │
├╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ y    ┆ Y       │
├╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ z    ┆ Z       │
└──────┴─────────┘

The remapping/translation step would be quite simple:

(
    df
    .remap_column(translation_df)
)
shape: (52, 2)
┌──────┬──────┐
│ col1 ┆ col2 │
│ ---  ┆ ---  │
│ str  ┆ str  │
╞══════╪══════╡
│ A    ┆ a    │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ B    ┆ b    │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ C    ┆ c    │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ D    ┆ d    │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ ...  ┆ ...  │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ W    ┆ W    │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ X    ┆ X    │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ Y    ┆ Y    │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ Z    ┆ Z    │
└──────┴──────┘

Series Example

To use this with a Series, we simply use polars.select and to_series. For example, with this series...

s = pl.Series(
    name='col1',
    values=string.ascii_letters,
)
s
shape: (52,)
Series: 'col1' [str]
[
        "a"
        "b"
        "c"
        "d"
        "e"
        "f"
        "g"
        "h"
        "i"
        "j"
        "k"
        "l"
        ...
        "O"
        "P"
        "Q"
        "R"
        "S"
        "T"
        "U"
        "V"
        "W"
        "X"
        "Y"
        "Z"
]

We can use the method as follows:

(
    pl.select(s)
    .remap_column(translation_df)
    .to_series()
)
shape: (52,)
Series: 'col1' [str]
[
        "A"
        "B"
        "C"
        "D"
        "E"
        "F"
        "G"
        "H"
        "I"
        "J"
        "K"
        "L"
        ...
        "O"
        "P"
        "Q"
        "R"
        "S"
        "T"
        "U"
        "V"
        "W"
        "X"
        "Y"
        "Z"
]

Performance

So, how well does this perform? Let's use a Series of 100 million integers:

nbr_rows = 100_000_000
s = pl.Series(
    name='col1',
    values=pl.arange(0, nbr_rows, eager=True)
).shuffle(0)
s
shape: (100000000,)
Series: 'col1' [i64]
[
        33448568
        12202271
        72671304
        11875034
        5118738
        91208323
        64245363
        48891782
        38730800
        30936779
        92306699
        59827933
        ...
        23209559
        22510749
        92157210
        97638183
        81885088
        39514414
        45637194
        58814748
        94294980
        89707900
        97988024
        43914027
]

Now let's create a translation DataFrame that is also 100 million records. (The translation will simply add one to each number.)

translation_df = pl.DataFrame(
    {
        "col1": pl.arange(0, nbr_rows, eager=True),
        "new_value": pl.arange(1, nbr_rows + 1, eager=True),
    }
)
translation_df
shape: (100000000, 2)
┌──────────┬───────────┐
│ col1     ┆ new_value │
│ ---      ┆ ---       │
│ i64      ┆ i64       │
╞══════════╪═══════════╡
│ 0        ┆ 1         │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 1        ┆ 2         │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 2        ┆ 3         │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 3        ┆ 4         │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ ...      ┆ ...       │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 99999996 ┆ 99999997  │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 99999997 ┆ 99999998  │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 99999998 ┆ 99999999  │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 99999999 ┆ 100000000 │
└──────────┴───────────┘

Now, let's time this:

import time
start = time.perf_counter()
(
    pl.select(s)
    .remap_column(translation_df)
    .to_series()
)
print(time.perf_counter() - start)
shape: (100000000,)
Series: 'col1' [i64]
[
        33448569
        12202272
        72671305
        11875035
        5118739
        91208324
        64245364
        48891783
        38730801
        30936780
        92306700
        59827934
        ...
        23209560
        22510750
        92157211
        97638184
        81885089
        39514415
        45637195
        58814749
        94294981
        89707901
        97988025
        43914028
]
>>> print(time.perf_counter() - start)
6.34915928099872

6 seconds of wall-clock time (on my 32-core Threadripper Pro). Obviously, the timing will differ on computing platforms, but the above algorithm should perform quite well in general.

Would an approach like this help?

cbilot avatar Sep 12 '22 17:09 cbilot

I should note: the above code replaces the column with new values, but the same approach using a left join should work for mapping values for new columns.

cbilot avatar Sep 12 '22 17:09 cbilot

I should note: the above code replaces the column with new values, but the same approach using a left join should work for mapping values for new columns.

Thank you @cbilot , I actually thought about using join instead of a nested when().then(), however there were two issues which restrained me:

  1. Notice when(col("a").eq("a")) in my example. In reality I have multiple levels, and more columns involved. For example, when when(col("a").eq("**B**").and(col("c").eq("**C**")). While this problem can be addressed with left join on multiple columns, left join naturally only checks for equality, ie if my statement contained .gt() or .neq() - I wouldn't be able to get equivallent result. And I do have some neq() occuranses unfortunately, but may be I could work around that.
  2. Notice .contains(). I use regex often.

Hence two follow up questions

  • can left join be performed by regex (in the righ DF)?
  • Can I combine when().then() with a join operation in this context? ie could I partly assign weights based on nontrivial when().then() and partly ( where I have long maps) to use join? Thinking about it, it might be possible. I could first do nontrivial when().then(). , and then join, and then fill_null (ie where when().then() was not used there I'd use the weight obtained from join operation). Will give that a shot, but first option is preffered to be honest.

AnatolyBuga avatar Sep 12 '22 19:09 AnatolyBuga

I don't think this is something we can fix. Very long when then expression lead to a lot of work stealing and might overflow. I will close this as I don't think this should be worked on on our side.

ritchie46 avatar Oct 21 '22 18:10 ritchie46