polars
polars copied to clipboard
thread 'main' has overflowed its stack when using many when().then()
Polars version checks
-
[X] I have checked that this issue has not already been reported.
-
[X] I have confirmed this bug exists on the latest version of polars.
Issue Description
Hello Polars Team,
Inspired by this advise, I have a logic which builds a rather long when().then() Expr from a hard coded HashMap. Appologies for a lengthy example, but it was necessary to reproduce. Interestengly, the code works in --release mode.
Not sure if this is a bug or expected. Is my HashMap just too long to chain in when().then() ? If so, is there any workaround/better way to achieve what I need?
Many thanks!
Reproducible Example
use polars::prelude::*;
use std::collections::HashMap;
pub fn issue(){
let df = df![
"a" => ["a", "b"],
"b" => ["c", "d"],
].unwrap();
let x: Option<f64> = None;
let not_yet_implemented = Series::new("null", &[x]).lit().list();
let never_reached = Series::new("null", &[x]).lit().list();
// Also Panics on
// let never_reached = NULL.lit()
// let not_yet_implemented = NULL.lit()
let lf = df.lazy().with_column(
when(col("a").eq("a"))
.then(rf_rw_map(
col("b"), weights(), not_yet_implemented)
)
.otherwise(never_reached)
);
dbg!(lf.collect());
}
pub(crate) fn weights() -> HashMap<String, Expr> {
// NOTE, in reality its more complicated than [;2], but this will do as an example
weights_raw().into_iter().map(|(k, v)|(k.to_string(), Series::new("", &[v/100.; 2]).lit().list())).collect()
}
fn rf_rw_map(c: Expr, map: HashMap<String, Expr>, other: Expr) -> Expr {
// buf is a placeholder
let mut it = map.into_iter();
let (k, v) = it.next().unwrap(); //The map will have at least one value
let mut buf = when(lit::<bool>(false))
.then(lit::<f64>(0.).list())
.when(c.clone().apply(
move |s|{
Ok(s.utf8()?.contains(k.as_str())?.into_series())},
GetOutput::from_type(DataType::Boolean),
))
.then(v);
for (k, v) in it {
buf = buf
.when(c.clone().apply(
move |s| Ok(s.utf8()?.contains(k.as_str())?.into_series()),
GetOutput::from_type(DataType::Boolean),
))
.then(v);
}
buf.otherwise(other)
}
/// 22.34
pub fn weights_raw() -> HashMap<&'static str, f64> {
HashMap::from([
("1" , 1.2f64),
("2" , 1.2),
("3" , 2.0),
("4" , 2.4),
("5" , 3.2),
("6" , 4.0),
("7" , 4.8),
("8" , 6.0),
("9" , 7.2),
("10" , 9.6),
("11" , 11.2),
("12" , 12.8),
("13" , 16.0),
("14" , 20.0),
("15", 24.8),
("16" , 30.4),
("17", 36.8),
("18", 36.8),
("19", 36.8),
("20",100.0),
("21", 100.0),
("22" ,100.0),
("23" ,1.2),
("24" ,1.2),
("25" ,2.4),
("26" ,3.2),
("27" ,4.8),
("28",6.4),
("29" ,9.6),
("30" , 13.6),
("31" , 17.6),
("32" , 26.4),
("33", 37.6),
("34" , 49.6),
("35" , 60.0),
("36", 72.0),
("37", 84.0),
("38" , 90.4),
("39", 100.0),
("40" ,100.0),
("41", 100.0),
("42",100.0),
("43", 100.0),
("44" ,100.0),
("45" , 1.2),
("46" , 1.2),
("47" , 1.2),
("48" , 1.2),
("49" , 4.0),
("50", 4.0),
])
}
Expected Behavior
I wouldn't expect a panic.
Installed Versions
@ritchie46 this is a pretty common operation, so much so that I've built tools to build these when/then branches myself specifically for this operation. Would it be difficult to build a map function similar to pandas' pandas.Series.map
without resorting to a long chain of expressions?
@AnatolyBug I'm not super familiar with Rust, but in python at least, I've gotten around using the long chain sometimes by using pl.col("name").apply(lambda x: weights[x])
, where weights
is a dict. It's most likely not as performant but it's syntactically much simpler.
Let me see if I can help. With code translations that are large, I strongly suggest using an approach that uses a left join rather than generating long when/then/when/then ...
chained expressions.
Python Prototype: remap_column
Below is a (working) prototype written as a Python DataFrame method. I realize that you are using Rust, but the translation to Rust should be straightforward (for the Polars steps). The prototype translates code values for a column based on a translation DataFrame.
The code should raise exceptions for input errors, but I've left those as ...
for now.
Performance Note: the prototype does not use Python dictionaries (which are slow) or apply
. As such, this should be quite performant.
Best of all, it can handle very large translation tables. (We'll see that below).
import polars as pl
from polars.internals.dataframe.frame import DF
def remap_column(
self: DF,
translation_df: DF,
) -> DF:
'''
Translate values in a column based on a translation table
Parameters
---------
translation_df
A two-column DataFrame, one column of which matches the name of the
column that will be translated. (The other column must not match an
existing column name in the original DataFrame.)
Note: it is the responsiblity of the caller to ensure that dtypes are
appropriate.
'''
if len(translation_df.columns) != 2:
...
col_nm = set(self.columns) & set(translation_df.columns)
if len(col_nm) != 1:
...
col_nm = col_nm.pop()
col_nm_prior = col_nm + "_prior"
col_nm_new_values = (set(translation_df.columns) - set(self.columns)).pop()
original_col_order = self.columns
result = (
self.rename({col_nm: col_nm_prior})
.join(
other=translation_df.rename(
{col_nm: col_nm_prior, col_nm_new_values: col_nm}
),
how="left",
left_on=col_nm_prior,
right_on=col_nm_prior,
)
.with_column(pl.col(col_nm).fill_null(pl.col(col_nm_prior)))
.select(original_col_order)
)
return self._from_pydf(result._df)
pl.DataFrame.remap_column = remap_column
DataFrame Example
Let's take a look at how this works with a DataFrame. Let's say we have this DataFrame, composed of the lowercase and uppercase letters.
import string
df = pl.DataFrame(
{
"col1": string.ascii_letters,
"col2": string.ascii_letters,
}
)
df
shape: (52, 2)
┌──────┬──────┐
│ col1 ┆ col2 │
│ --- ┆ --- │
│ str ┆ str │
╞══════╪══════╡
│ a ┆ a │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ b ┆ b │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ c ┆ c │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ d ┆ d │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ ... ┆ ... │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ W ┆ W │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ X ┆ X │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ Y ┆ Y │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ Z ┆ Z │
└──────┴──────┘
And let's create a translation table intended to translate a lowercase code in col1
to uppercase.
translation_df = pl.DataFrame(
{
"col1": string.ascii_lowercase,
"new_val": string.ascii_uppercase,
}
)
translation_df
shape: (26, 2)
┌──────┬─────────┐
│ col1 ┆ new_val │
│ --- ┆ --- │
│ str ┆ str │
╞══════╪═════════╡
│ a ┆ A │
├╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ b ┆ B │
├╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ c ┆ C │
├╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ d ┆ D │
├╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ ... ┆ ... │
├╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ w ┆ W │
├╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ x ┆ X │
├╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ y ┆ Y │
├╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ z ┆ Z │
└──────┴─────────┘
The remapping/translation step would be quite simple:
(
df
.remap_column(translation_df)
)
shape: (52, 2)
┌──────┬──────┐
│ col1 ┆ col2 │
│ --- ┆ --- │
│ str ┆ str │
╞══════╪══════╡
│ A ┆ a │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ B ┆ b │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ C ┆ c │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ D ┆ d │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ ... ┆ ... │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ W ┆ W │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ X ┆ X │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ Y ┆ Y │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ Z ┆ Z │
└──────┴──────┘
Series Example
To use this with a Series, we simply use polars.select
and to_series
. For example, with this series...
s = pl.Series(
name='col1',
values=string.ascii_letters,
)
s
shape: (52,)
Series: 'col1' [str]
[
"a"
"b"
"c"
"d"
"e"
"f"
"g"
"h"
"i"
"j"
"k"
"l"
...
"O"
"P"
"Q"
"R"
"S"
"T"
"U"
"V"
"W"
"X"
"Y"
"Z"
]
We can use the method as follows:
(
pl.select(s)
.remap_column(translation_df)
.to_series()
)
shape: (52,)
Series: 'col1' [str]
[
"A"
"B"
"C"
"D"
"E"
"F"
"G"
"H"
"I"
"J"
"K"
"L"
...
"O"
"P"
"Q"
"R"
"S"
"T"
"U"
"V"
"W"
"X"
"Y"
"Z"
]
Performance
So, how well does this perform? Let's use a Series of 100 million integers:
nbr_rows = 100_000_000
s = pl.Series(
name='col1',
values=pl.arange(0, nbr_rows, eager=True)
).shuffle(0)
s
shape: (100000000,)
Series: 'col1' [i64]
[
33448568
12202271
72671304
11875034
5118738
91208323
64245363
48891782
38730800
30936779
92306699
59827933
...
23209559
22510749
92157210
97638183
81885088
39514414
45637194
58814748
94294980
89707900
97988024
43914027
]
Now let's create a translation DataFrame that is also 100 million records. (The translation will simply add one to each number.)
translation_df = pl.DataFrame(
{
"col1": pl.arange(0, nbr_rows, eager=True),
"new_value": pl.arange(1, nbr_rows + 1, eager=True),
}
)
translation_df
shape: (100000000, 2)
┌──────────┬───────────┐
│ col1 ┆ new_value │
│ --- ┆ --- │
│ i64 ┆ i64 │
╞══════════╪═══════════╡
│ 0 ┆ 1 │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 1 ┆ 2 │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 2 ┆ 3 │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 3 ┆ 4 │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ ... ┆ ... │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 99999996 ┆ 99999997 │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 99999997 ┆ 99999998 │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 99999998 ┆ 99999999 │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 99999999 ┆ 100000000 │
└──────────┴───────────┘
Now, let's time this:
import time
start = time.perf_counter()
(
pl.select(s)
.remap_column(translation_df)
.to_series()
)
print(time.perf_counter() - start)
shape: (100000000,)
Series: 'col1' [i64]
[
33448569
12202272
72671305
11875035
5118739
91208324
64245364
48891783
38730801
30936780
92306700
59827934
...
23209560
22510750
92157211
97638184
81885089
39514415
45637195
58814749
94294981
89707901
97988025
43914028
]
>>> print(time.perf_counter() - start)
6.34915928099872
6 seconds of wall-clock time (on my 32-core Threadripper Pro). Obviously, the timing will differ on computing platforms, but the above algorithm should perform quite well in general.
Would an approach like this help?
I should note: the above code replaces the column with new values, but the same approach using a left join should work for mapping values for new columns.
I should note: the above code replaces the column with new values, but the same approach using a left join should work for mapping values for new columns.
Thank you @cbilot , I actually thought about using join
instead of a nested when().then()
, however there were two issues which restrained me:
- Notice
when(col("a").eq("a"))
in my example. In reality I have multiple levels, and more columns involved. For example, whenwhen(col("a").eq("**B**").and(col("c").eq("**C**"))
. While this problem can be addressed with left join on multiple columns, left join naturally only checks for equality, ie if my statement contained.gt()
or.neq()
- I wouldn't be able to get equivallent result. And I do have someneq()
occuranses unfortunately, but may be I could work around that. - Notice
.contains()
. I use regex often.
Hence two follow up questions
- can left join be performed by regex (in the righ DF)?
- Can I combine
when().then()
with a join operation in this context? ie could I partly assign weights based on nontrivial when().then() and partly ( where I have long maps) to use join? Thinking about it, it might be possible. I could first do nontrivialwhen().then()
. , and then join, and thenfill_null
(ie wherewhen().then()
was not used there I'd use the weight obtained from join operation). Will give that a shot, but first option is preffered to be honest.
I don't think this is something we can fix. Very long when then expression lead to a lot of work stealing and might overflow. I will close this as I don't think this should be worked on on our side.