polars icon indicating copy to clipboard operation
polars copied to clipboard

Allow chained renaming operations

Open vultix opened this issue 2 years ago • 1 comments

Describe your feature request

There currently seems to be two restraints on renaming operations (prefix, suffix, map_alias):

  • These must be the last operation done on an expression
  • There cannot be more than one of these operations

I would ideally like to remove both of these constraints, allowing us to do operations like this:

# This expression would yield two columns: `a_pct_change_mean` and `b_pct_change_mean`
pl.col(["a", "b"]).pct_change().suffix("_pct_change").mean().suffix("_mean")

I imagine a buffer of naming operations could be kept, and .alias would effectively reset that buffer.

Motivating Example

I'm working to replicate the stockstats pandas library in polars. I have dozens of functions that generate expressions to calculate stock market indicators. Here's an example of two:

def sma(input: pl.Expr, window: int, min_periods=1) -> pl.Expr:
    """Simple moving average"""
    result = input.rolling_mean(window, min_periods=min_periods)
    return result.suffix(f"_{window}_sma")

def boll_ub(input: pl.Expr = pl.col("close"), period=20, n_std_devs: float = 2.0) -> pl.Expr:
    """
    Bollinger Upper Band
    @see https://www.investopedia.com/terms/b/bollingerbands.asp
    """
    return sma(input, period).suffix(f"_{period}_boll_ub") + moving_std(input, period, n_std_devs)

Trying to run a select operation using boll_ub leads to an error saying that .suffix must be the last operation.

Another use case is to do multiple nested simple moving averages:

# Ideally this would name the column `close_10_sma_20_sma`
expr = sma(sma(pl.col("close"), 10), 20)

# That default name is good, but could be better.  Ideally end users could still use alias or map_alias to change the name
expr = expr.alias("close_double_sma")

Recall that there are dozens of operations like this, calling multiple other stock indicators internally and doing combinations of aliases and suffixes. Ideally, I could simply do a select operation like this and each column would automatically name itself:

close = pl.col("close")

stock_df = df.select([
    close, "open", "high", "low", "volume", "change", "change_over_time", "change_percent", "vwap",
    sma(pl.col(["open", "close"], 5), sma(close, 7), sma(sma(close, 10), 10), macd(), boll(), boll_ub(), 
    boll_lb(), rsi(), dx(), trix(close, 12), vwma(), chop(), cr()
])

Alternatives

Manually name each indicator I could stop using .suffix and .map_alias operations, and instead require the column naming to be done manually, but this severely reduces readability and maintainability, especially given I expect there to be many more indicators than this.

stock_df = df.select([
    close, "open", "high", "low", "volume", "change", "change_over_time", "change_percent", "vwap",
    sma(pl.col(["open", "close"], 5)).suffix("_5_sma"), sma(close, 7).alias("close_5_sma"),
    sma(sma(close, 10), 20).alias("close_10_sma_20_sma"),
    macd().alias("close_macd"), boll().alias("close_boll_20"), boll_ub().alias("close_boll_ub_20"),
    boll_lb().alias("close_boll_ub_20"), rsi().alias("close_rsi"),
    dx().alias("close_dx"), trix(close, 12), vwma().alias("close_vwma"), chop().alias("chop"), cr().alias("cr")
])

# Note that most operations default to operating on close, which is why macd() is aliased to "close_macd"

Make the suffixes optional The path I'm leaning towards right now is to add a add_suffix=True parameter to each of the indicators. For instance:

def sma(input: pl.Expr, window: int, min_periods=1, add_suffix=True) -> pl.Expr:
    """Simple moving average"""
    result = input.rolling_mean(window, min_periods=min_periods)
    return result.suffix(f"_{window}_sma") if add_suffix else result

This has a few problems:

  • It doesn't help with nested calls sma(sma(close, 10), 20)
  • This pollutes the codebase with tons of add_suffix=False
  • It's not immediately clear when you can or should use add_suffix

Expose another API I strongly considered creating an API like this:

def parse_stockstats_expressions(expressions: List[str]) -> List[pl.Expr]:
	pass # TODO

stock_stats = df.select(parse_stockstats_expressions(["close", "open", "close_5_sma", "close_5_sma_trix", ...])

Given polars' current constraints, this is perhaps the cleanest API. I don't like the magical nature of this, and parsing is error prone. I much prefer the more explicit function call notation from above, especially as it makes it trivial to jump to the source and documentation for each indicator.

Conclusion

Thanks for the help everyone! I'm absolutely loving polars, and would be willing to contribute if this is a path we choose to go down.

vultix avatar Jul 16 '22 22:07 vultix

An additional area where things are breaking, I can't do sma(close, 10).over("ticker") without getting this error:

`keep_name`, `suffix`, `prefix` should be last expression

vultix avatar Jul 16 '22 23:07 vultix

Also, this fails without returning anything or giving an error:

pl.DataFrame({'a': [0], 'b': [1]}).select(pl.all().prefix('_mean').suffix('avg_'))
Polars: 0.16.14
Index type: UInt32
Platform: Windows-10-10.0.22621-SP0
Python: 3.7.10 (default, Feb 26 2021, 13:06:18) [MSC v.1916 64 bit (AMD64)]
---Optional dependencies---
numpy: 1.20.3
pandas: 1.2.5
pyarrow: 4.0.1
connectorx: <not installed>
deltalake: <not installed>
fsspec: <not installed>
matplotlib: 3.1.1
xlsx2csv: <not installed>
xlsxwriter: <not installed>

dah33 avatar Mar 25 '23 20:03 dah33