polars
polars copied to clipboard
Allow chained renaming operations
Describe your feature request
There currently seems to be two restraints on renaming operations (prefix
, suffix
, map_alias
):
- These must be the last operation done on an expression
- There cannot be more than one of these operations
I would ideally like to remove both of these constraints, allowing us to do operations like this:
# This expression would yield two columns: `a_pct_change_mean` and `b_pct_change_mean`
pl.col(["a", "b"]).pct_change().suffix("_pct_change").mean().suffix("_mean")
I imagine a buffer of naming operations could be kept, and .alias
would effectively reset that buffer.
Motivating Example
I'm working to replicate the stockstats pandas library in polars. I have dozens of functions that generate expressions to calculate stock market indicators. Here's an example of two:
def sma(input: pl.Expr, window: int, min_periods=1) -> pl.Expr:
"""Simple moving average"""
result = input.rolling_mean(window, min_periods=min_periods)
return result.suffix(f"_{window}_sma")
def boll_ub(input: pl.Expr = pl.col("close"), period=20, n_std_devs: float = 2.0) -> pl.Expr:
"""
Bollinger Upper Band
@see https://www.investopedia.com/terms/b/bollingerbands.asp
"""
return sma(input, period).suffix(f"_{period}_boll_ub") + moving_std(input, period, n_std_devs)
Trying to run a select operation using boll_ub
leads to an error saying that .suffix
must be the last operation.
Another use case is to do multiple nested simple moving averages:
# Ideally this would name the column `close_10_sma_20_sma`
expr = sma(sma(pl.col("close"), 10), 20)
# That default name is good, but could be better. Ideally end users could still use alias or map_alias to change the name
expr = expr.alias("close_double_sma")
Recall that there are dozens of operations like this, calling multiple other stock indicators internally and doing combinations of aliases and suffixes. Ideally, I could simply do a select operation like this and each column would automatically name itself:
close = pl.col("close")
stock_df = df.select([
close, "open", "high", "low", "volume", "change", "change_over_time", "change_percent", "vwap",
sma(pl.col(["open", "close"], 5), sma(close, 7), sma(sma(close, 10), 10), macd(), boll(), boll_ub(),
boll_lb(), rsi(), dx(), trix(close, 12), vwma(), chop(), cr()
])
Alternatives
Manually name each indicator
I could stop using .suffix
and .map_alias
operations, and instead require the column naming to be done manually, but this severely reduces readability and maintainability, especially given I expect there to be many more indicators than this.
stock_df = df.select([
close, "open", "high", "low", "volume", "change", "change_over_time", "change_percent", "vwap",
sma(pl.col(["open", "close"], 5)).suffix("_5_sma"), sma(close, 7).alias("close_5_sma"),
sma(sma(close, 10), 20).alias("close_10_sma_20_sma"),
macd().alias("close_macd"), boll().alias("close_boll_20"), boll_ub().alias("close_boll_ub_20"),
boll_lb().alias("close_boll_ub_20"), rsi().alias("close_rsi"),
dx().alias("close_dx"), trix(close, 12), vwma().alias("close_vwma"), chop().alias("chop"), cr().alias("cr")
])
# Note that most operations default to operating on close, which is why macd() is aliased to "close_macd"
Make the suffixes optional
The path I'm leaning towards right now is to add a add_suffix=True
parameter to each of the indicators. For instance:
def sma(input: pl.Expr, window: int, min_periods=1, add_suffix=True) -> pl.Expr:
"""Simple moving average"""
result = input.rolling_mean(window, min_periods=min_periods)
return result.suffix(f"_{window}_sma") if add_suffix else result
This has a few problems:
- It doesn't help with nested calls
sma(sma(close, 10), 20)
- This pollutes the codebase with tons of
add_suffix=False
- It's not immediately clear when you can or should use
add_suffix
Expose another API I strongly considered creating an API like this:
def parse_stockstats_expressions(expressions: List[str]) -> List[pl.Expr]:
pass # TODO
stock_stats = df.select(parse_stockstats_expressions(["close", "open", "close_5_sma", "close_5_sma_trix", ...])
Given polars' current constraints, this is perhaps the cleanest API. I don't like the magical nature of this, and parsing is error prone. I much prefer the more explicit function call notation from above, especially as it makes it trivial to jump to the source and documentation for each indicator.
Conclusion
Thanks for the help everyone! I'm absolutely loving polars, and would be willing to contribute if this is a path we choose to go down.
An additional area where things are breaking, I can't do sma(close, 10).over("ticker")
without getting this error:
`keep_name`, `suffix`, `prefix` should be last expression
Also, this fails without returning anything or giving an error:
pl.DataFrame({'a': [0], 'b': [1]}).select(pl.all().prefix('_mean').suffix('avg_'))
Polars: 0.16.14
Index type: UInt32
Platform: Windows-10-10.0.22621-SP0
Python: 3.7.10 (default, Feb 26 2021, 13:06:18) [MSC v.1916 64 bit (AMD64)]
---Optional dependencies---
numpy: 1.20.3
pandas: 1.2.5
pyarrow: 4.0.1
connectorx: <not installed>
deltalake: <not installed>
fsspec: <not installed>
matplotlib: 3.1.1
xlsx2csv: <not installed>
xlsxwriter: <not installed>