[UNIT TEST TRACKER] tsfresh
Tracker for the unit test. The unit tests should cover pl.Series, pl.Expr for eager and lazy (if implemented).
- [x] absolute_energy
- [x] absolute_maximum
- [x] absolute_sum_of_changes
- [x] approximate_entropy
- [x] autocorrelation
- [ ] autoregressive_coefficients
- [x] benford_correlation
- [x] binned_entropy
- [x] c3
- [x] change_quantiles
- [x] cid_ce
- [x] count_above
- [x] count_above_mean
- [x] count_below
- [x] count_below_mean
- [x] energy_ratios
- [x] first_location_of_maximum
- [x] first_location_of_minimum
- [x] has_duplicate
- [x] has_duplicate_max
- [x] has_duplicate_min
- [x] index_mass_quantile
- [x] large_standard_deviation
- [x] last_location_of_maximum
- [x] last_location_of_minimum
- [x] longest_strike_above_mean
- [x] longest_strike_below_mean
- [x] mean_abs_change
- [x] mean_change
- [x] mean_n_absolute_max
- [x] mean_second_derivative_central
- [x] number_crossings
- [ ] number_cwt_peaks
- [x] number_peaks
- [x] percent_reoccuring_values
- [x] percent_reocurring_points
- [x] permutation_entropy
- [x] range_count
- [x] ratio_beyond_r_sigma
- [x] ratio_n_unique_to_length
- [x] root_mean_square
- [x] sample_entropy
- [ ] spkt_welch_density
- [x] sum_reocurring_points
- [x] sum_reocurring_values
- [x] symmetry_looking
- [x] time_reversal_asymmetry_statistic
- [x] variation_coefficient
- [x] var_gt_std
- [ ] cwt_coefficients
- [x] fourier_entropy
- [ ] friedrich_coefficients
- [x] lempel_ziv_complexity
- [x] linear_trend
- [ ] partial_autocorrelation
Additional Features:
- [x] - range_over_mean (This and range_change are tested but tests are non-exhaustive.)
- [x] - range_change
- [x] - longest_winning_streak (special case of longest_streak_above)
- [x] - longest_losing_streak (special case of longest_streak_below)
- [x] - streak_length_stats
- [x] - longest_streak_above
- [x] - longest_streak_below
- [x] - max_abs_change
currently working on:
- variation_coefficient,
- var_gt_std,
- large_standard_deviation
- range_count
- ratio_beyond_r_sigma
- ratio_n_unique_to_length
- root_mean_square
- variation_coefficient
- var_gt_std
- large_standard_deviation
- range_count
- mean_change
- mean_abs_change
submitted for pull request.
The entropy ones are done in my PR https://github.com/neocortexdb/functime/pull/91
I'm going to start on:
- linear_trend.
I will work on it in a new branch and make a separate PR for ease of review.
As discussed in discord, we will follow:
https://en.wikipedia.org/wiki/IEEE_754
In summary:
- 0/0 -> np.nan
- ∞×0 -> np.nan
- x > 0, x/0 -> np.inf
- x < 0, x/0 -> np.NINF
After we are done with the unit tests, we will double check to verify our conformity to IEEE 754
I have made PR for linear trend.
- number_crossings
- number_cwt_peaks
- autoregressive_coefficients
Did number_crossings today @claysmyth, sorry I forgot to add it here.
- friedrich_coefficients
- streak_length_stats
@abstractqqq did you have expected behaviour in mind for streak_length_stats where there are no streaks?
Right now it will filter to an empty Series/DataFrame at:
y = y.filter(y.struct.field("values")).struct.field("lengths")
Then, the outputs are:
- if input is Series and no streaks: return None as all dictionary values i.e.
{"min":None, "max":None, etc.}
(It needs a correction for mode in this case where input type Series and there are no streaks, but the error handling for this is fairly straightforward).
- if input is DataFrame/LazyFrame and there are no streaks: return an expression which will evaluate to an empty DataFrame.
Is this behaviour okay? I'm thinking probably no, as the point in these panel transformations is to return a primitive for each panel.
I think it makes more sense to return a single row DataFrame also when there are no streaks, but I am struggling to make this compatible with lazy execution.
@abstractqqq did you have expected behaviour in mind for streak_length_stats where there are no streaks?
Right now it will filter to an empty Series/DataFrame at:
y = y.filter(y.struct.field("values")).struct.field("lengths")Then, the outputs are:
- if input is Series and no streaks: return None as all dictionary values i.e.
{"min":None, "max":None, etc.}(It needs a correction for mode in this case where input type Series and there are no streaks, but the error handling for this is fairly straightforward).
- if input is DataFrame/LazyFrame and there are no streaks: return an expression which will evaluate to an empty DataFrame.
Is this behaviour okay? I'm thinking probably no, as the point in these panel transformations is to return a primitive for each panel.
I think it makes more sense to return a single row DataFrame also when there are no streaks, but I am struggling to make this compatible with lazy execution.
Thank you for looking into this. This is my take:
- For eager (series), I think it is fine to return dict with Nones, but maybe we fill min to be 0.
- For lazy, I do not intend users to call this function themselves, meaning that in the namespace, we should create a method like extract_streak_stats, and we extract the min, max, avg... out from the struct and fill nulls accordingly. So I think it is better to keep it as a struct. Maybe we add an underscore prefix for the function to suggest to the user that we don't want them to use it from the tsfresh.py module.. I think namespace is the right place.
@abstractqqq did you have expected behaviour in mind for streak_length_stats where there are no streaks? Right now it will filter to an empty Series/DataFrame at:
y = y.filter(y.struct.field("values")).struct.field("lengths")Then, the outputs are:
- if input is Series and no streaks: return None as all dictionary values i.e.
{"min":None, "max":None, etc.}(It needs a correction for mode in this case where input type Series and there are no streaks, but the error handling for this is fairly straightforward).
- if input is DataFrame/LazyFrame and there are no streaks: return an expression which will evaluate to an empty DataFrame.
Is this behaviour okay? I'm thinking probably no, as the point in these panel transformations is to return a primitive for each panel. I think it makes more sense to return a single row DataFrame also when there are no streaks, but I am struggling to make this compatible with lazy execution.
Thank you for looking into this. This is my take:
- For eager (series), I think it is fine to return dict with Nones, but maybe we fill min to be 0.
- For lazy, I do not intend users to call this function themselves, meaning that in the namespace, we should create a method like extract_streak_stats, and we extract the min, max, avg... out from the struct and fill nulls accordingly. So I think it is better to keep it as a struct. Maybe we add an underscore prefix for the function to suggest to the user that we don't want them to use it from the tsfresh.py module.. I think namespace is the right place.
Nice, thanks that is useful and I have made the small changes to do as you propose for Series. I'll add this in a PR along with tests.
For LazyFrames with no streaks, this will currently return an empty struct column.
shape: (0, 1) ┌───────────┐ │ min │ │ --- │ │ struct[8] │ ╞═══════════╡ └───────────┘ Is that okay?
I'm concerned, as if you are performing this transformation alongside any other column transformations the shape of the outputs will be different, so you won't be able to safely perform this transformation alongside any others . Filling nulls won't solve the issue here. But, with the user not executing themselves, you may have something in mind already, such as the threshold argument being a value that is guaranteed to be in the series, and therefore there being at least a one-length streak.
@MathieuCayssol In my latest branch, I gated augmented_dickey_fuller behind a UseAtOwnRisk warning. I posted about this on discord. Let's remove augmented_dickey_fuller from the list of features we want to test now.
@abstractqqq did you have expected behaviour in mind for streak_length_stats where there are no streaks? Right now it will filter to an empty Series/DataFrame at:
y = y.filter(y.struct.field("values")).struct.field("lengths")Then, the outputs are:
- if input is Series and no streaks: return None as all dictionary values i.e.
{"min":None, "max":None, etc.}(It needs a correction for mode in this case where input type Series and there are no streaks, but the error handling for this is fairly straightforward).
- if input is DataFrame/LazyFrame and there are no streaks: return an expression which will evaluate to an empty DataFrame.
Is this behaviour okay? I'm thinking probably no, as the point in these panel transformations is to return a primitive for each panel. I think it makes more sense to return a single row DataFrame also when there are no streaks, but I am struggling to make this compatible with lazy execution.
Thank you for looking into this. This is my take:
- For eager (series), I think it is fine to return dict with Nones, but maybe we fill min to be 0.
- For lazy, I do not intend users to call this function themselves, meaning that in the namespace, we should create a method like extract_streak_stats, and we extract the min, max, avg... out from the struct and fill nulls accordingly. So I think it is better to keep it as a struct. Maybe we add an underscore prefix for the function to suggest to the user that we don't want them to use it from the tsfresh.py module.. I think namespace is the right place.
Nice, thanks that is useful and I have made the small changes to do as you propose for Series. I'll add this in a PR along with tests.
For LazyFrames with no streaks, this will currently return an empty struct column.
shape: (0, 1) ┌───────────┐ │ min │ │ --- │ │ struct[8] │ ╞═══════════╡ └───────────┘ Is that okay?
I'm concerned, as if you are performing this transformation alongside any other column transformations the shape of the outputs will be different, so you won't be able to safely perform this transformation alongside any others . Filling nulls won't solve the issue here. But, with the user not executing themselves, you may have something in mind already, such as the threshold argument being a value that is guaranteed to be in the series, and therefore there being at least a one-length streak.
I think the returning an empty row is OK if: If I do something like pl.col("a").tse.streak_length_stats().struct.field("mean"), then a Null returns.
I guess you are already working on this so I won't touch the code. Thanks for looking into this!
- autoregressive_coefficients
- benford_correlation2