Additional operator for `update_by` requested
To support production use cases, we need the following operators (also found in #4424):
- median
- percentile
- var
- cor
- count_neg, count_pos
- cum_std
But also needed are the following (supported by pandas / Polars):
- last
- rank
- pct_change
Other count operations like count_null, count_nan, etc. would be useful.
As we have done in other cases, null values should be ignored, and NaN values are included -- typically resulting in poisoning.
I looked through Pandas docs and found a few more operations that we should really support:
Below is an attempt at a more comprehensive and carefully curated list.
As has been the case for other operations:
nullvalues are ignored in calculations.NaNvalues are included in calculations. Typically, this means thatNaNpoisons results, so the operator will returnNaNafter seeing aNaN.+0.0and-0.0are considered to be the same and equivalent.
Operators have a few different contexts:
- agg
- update_by cumulative
- update_by window / rolling
Missing cumulative operators:
- [ ]
cum_avg - [ ]
cum_wavg - [ ]
cum_std - [ ]
cum_counthttps://github.com/deephaven/deephaven-core/pull/6270 - [ ]
cum_formula? - [ ]
cum_group
New operators (singleton)
- [ ]
delta_pct(Naming seems more consistent with the existingdeltathan the originally proposedpct_change)
New operators (agg, cumulative, and rolling):
- [ ]
count_neg+ https://github.com/deephaven/deephaven-core/pull/6358 - [ ]
count_pos+ https://github.com/deephaven/deephaven-core/pull/6358 - [ ]
count_zero+ https://github.com/deephaven/deephaven-core/pull/6358 - [ ]
count_null+ https://github.com/deephaven/deephaven-core/pull/6358 - [ ]
count_nan+ https://github.com/deephaven/deephaven-core/pull/6358 - [ ]
count_inf+ https://github.com/deephaven/deephaven-core/pull/6358 - [ ]
count_finite+ https://github.com/deephaven/deephaven-core/pull/6358 - [ ]
count_non_zero+ https://github.com/deephaven/deephaven-core/pull/6358 - [ ]
count_non_negative+ https://github.com/deephaven/deephaven-core/pull/6358 - [ ]
count_non_positive+ https://github.com/deephaven/deephaven-core/pull/6358 - [ ]
first! - [ ]
last! - [ ]
offset! - [ ]
median - [ ]
rank - [ ]
percentile(pctmay be a name more consistent with agg) - [ ]
abs_sum - [ ]
abs_avg - [ ]
abs_wavg+ - [ ]
wstd - [ ]
ste - [ ]
wste - [ ]
var - [ ]
wvar - [ ]
tstat - [ ]
wtstat - [ ]
skew*+ - [ ]
kurtosis*+ - [ ]
cov* - [ ]
cor
Don't Do Operators? (Present in agg)
These are present in agg, but they may not be worth adding to the other cases until there is demand. They need some discussion.
- []
distinct - []
unique - []
sorted_first - []
sorted_last
(?) There will be some debate on if this method should be implemented because of efficiency.
(*) May involve some tricky, careful numerics to compute good values. Need to be careful in defining the calculation.
(+) Not yet implemented in Numerics.ftl
(!) There has been some discussion around these operations with @rcaudy and @chipkent . cum_first/cum_last are the same as first_by/last_by, so there is an argument to not include them. offset is proposed as a way to get a value at a specific index or time offset instead of having a first/last operator. For time offsets, there needs to be a way to disambiguate if there are multiple values with the same time offset. offset would not be supported by agg, but first and last would.
Details on computing skewness and excess kurtosis can be found at:
- https://www.macroption.com/skewness-formula/
- https://en.wikipedia.org/wiki/Skewness#Sample_skewness
- https://en.wikipedia.org/wiki/Kurtosis#Sample_kurtosis
We want the sample skewness and sample excess kurtosis. The formulae used by Excel, SAS, etc. have probably been well vetted.
Details on computing the sample covariance can be found at:
- https://en.wikipedia.org/wiki/Covariance#Calculating_the_sample_covariance
Hi @chipkent, are there any plans to introduce this functionality?