functime [UNIT TEST TRACKER] tsfresh

Tracker for the unit test. The unit tests should cover pl.Series, pl.Expr for eager and lazy (if implemented).

[x] absolute_energy
[x] absolute_maximum
[x] absolute_sum_of_changes
[x] approximate_entropy
[x] autocorrelation
[ ] autoregressive_coefficients
[x] benford_correlation
[x] binned_entropy
[x] c3
[x] change_quantiles
[x] cid_ce
[x] count_above
[x] count_above_mean
[x] count_below
[x] count_below_mean
[x] energy_ratios
[x] first_location_of_maximum
[x] first_location_of_minimum
[x] has_duplicate
[x] has_duplicate_max
[x] has_duplicate_min
[x] index_mass_quantile
[x] large_standard_deviation
[x] last_location_of_maximum
[x] last_location_of_minimum
[x] longest_strike_above_mean
[x] longest_strike_below_mean
[x] mean_abs_change
[x] mean_change
[x] mean_n_absolute_max
[x] mean_second_derivative_central
[x] number_crossings
[ ] number_cwt_peaks
[x] number_peaks
[x] percent_reoccuring_values
[x] percent_reocurring_points
[x] permutation_entropy
[x] range_count
[x] ratio_beyond_r_sigma
[x] ratio_n_unique_to_length
[x] root_mean_square
[x] sample_entropy
[ ] spkt_welch_density
[x] sum_reocurring_points
[x] sum_reocurring_values
[x] symmetry_looking
[x] time_reversal_asymmetry_statistic
[x] variation_coefficient
[x] var_gt_std
[ ] cwt_coefficients
[x] fourier_entropy
[ ] friedrich_coefficients
[x] lempel_ziv_complexity
[x] linear_trend
[ ] partial_autocorrelation

Additional Features:

[x] - range_over_mean (This and range_change are tested but tests are non-exhaustive.)
[x] - range_change
[x] - longest_winning_streak (special case of longest_streak_above)
[x] - longest_losing_streak (special case of longest_streak_below)
[x] - streak_length_stats
[x] - longest_streak_above
[x] - longest_streak_below
[x] - max_abs_change

Oct 11 '23 19:10 MathieuCayssol

currently working on:

variation_coefficient,
var_gt_std,
large_standard_deviation
range_count

Oct 16 '23 18:10 TomBurdge

ratio_beyond_r_sigma
ratio_n_unique_to_length
root_mean_square

Oct 16 '23 20:10 MathieuCayssol

variation_coefficient
var_gt_std
large_standard_deviation
range_count
mean_change
mean_abs_change

submitted for pull request.

Oct 17 '23 10:10 TomBurdge

The entropy ones are done in my PR https://github.com/neocortexdb/functime/pull/91

Oct 18 '23 06:10 abstractqqq

I'm going to start on:

linear_trend.

I will work on it in a new branch and make a separate PR for ease of review.

Oct 18 '23 13:10 TomBurdge

As discussed in discord, we will follow:

https://en.wikipedia.org/wiki/IEEE_754

In summary:

0/0 -> np.nan
∞×0 -> np.nan
x > 0, x/0 -> np.inf
x < 0, x/0 -> np.NINF

After we are done with the unit tests, we will double check to verify our conformity to IEEE 754

Oct 18 '23 16:10 MathieuCayssol

I have made PR for linear trend.

Oct 19 '23 11:10 TomBurdge

number_crossings
number_cwt_peaks
autoregressive_coefficients

Oct 20 '23 16:10 claysmyth

Did number_crossings today @claysmyth, sorry I forgot to add it here.

Oct 20 '23 17:10 plaales

friedrich_coefficients

Oct 21 '23 11:10 plaales

streak_length_stats

Oct 22 '23 18:10 TomBurdge

@abstractqqq did you have expected behaviour in mind for streak_length_stats where there are no streaks?

Right now it will filter to an empty Series/DataFrame at: y = y.filter(y.struct.field("values")).struct.field("lengths")

Then, the outputs are:

if input is Series and no streaks: return None as all dictionary values i.e. {"min":None, "max":None, etc.}

(It needs a correction for mode in this case where input type Series and there are no streaks, but the error handling for this is fairly straightforward).

if input is DataFrame/LazyFrame and there are no streaks: return an expression which will evaluate to an empty DataFrame.

Is this behaviour okay? I'm thinking probably no, as the point in these panel transformations is to return a primitive for each panel.

I think it makes more sense to return a single row DataFrame also when there are no streaks, but I am struggling to make this compatible with lazy execution.

Oct 23 '23 15:10 TomBurdge

@abstractqqq did you have expected behaviour in mind for streak_length_stats where there are no streaks?

Right now it will filter to an empty Series/DataFrame at: y = y.filter(y.struct.field("values")).struct.field("lengths")

Then, the outputs are:

if input is Series and no streaks: return None as all dictionary values i.e. {"min":None, "max":None, etc.}

(It needs a correction for mode in this case where input type Series and there are no streaks, but the error handling for this is fairly straightforward).

if input is DataFrame/LazyFrame and there are no streaks: return an expression which will evaluate to an empty DataFrame.

Is this behaviour okay? I'm thinking probably no, as the point in these panel transformations is to return a primitive for each panel.

I think it makes more sense to return a single row DataFrame also when there are no streaks, but I am struggling to make this compatible with lazy execution.

Thank you for looking into this. This is my take:

For eager (series), I think it is fine to return dict with Nones, but maybe we fill min to be 0.
For lazy, I do not intend users to call this function themselves, meaning that in the namespace, we should create a method like extract_streak_stats, and we extract the min, max, avg... out from the struct and fill nulls accordingly. So I think it is better to keep it as a struct. Maybe we add an underscore prefix for the function to suggest to the user that we don't want them to use it from the tsfresh.py module.. I think namespace is the right place.

Oct 23 '23 17:10 abstractqqq

@abstractqqq did you have expected behaviour in mind for streak_length_stats where there are no streaks? Right now it will filter to an empty Series/DataFrame at: y = y.filter(y.struct.field("values")).struct.field("lengths") Then, the outputs are:

if input is Series and no streaks: return None as all dictionary values i.e. {"min":None, "max":None, etc.}

(It needs a correction for mode in this case where input type Series and there are no streaks, but the error handling for this is fairly straightforward).

if input is DataFrame/LazyFrame and there are no streaks: return an expression which will evaluate to an empty DataFrame.

Is this behaviour okay? I'm thinking probably no, as the point in these panel transformations is to return a primitive for each panel. I think it makes more sense to return a single row DataFrame also when there are no streaks, but I am struggling to make this compatible with lazy execution.

Thank you for looking into this. This is my take:

For eager (series), I think it is fine to return dict with Nones, but maybe we fill min to be 0.

For lazy, I do not intend users to call this function themselves, meaning that in the namespace, we should create a method like extract_streak_stats, and we extract the min, max, avg... out from the struct and fill nulls accordingly. So I think it is better to keep it as a struct. Maybe we add an underscore prefix for the function to suggest to the user that we don't want them to use it from the tsfresh.py module.. I think namespace is the right place.

Nice, thanks that is useful and I have made the small changes to do as you propose for Series. I'll add this in a PR along with tests.

For LazyFrames with no streaks, this will currently return an empty struct column.

shape: (0, 1) ┌───────────┐ │ min │ │ --- │ │ struct[8] │ ╞═══════════╡ └───────────┘ Is that okay?

I'm concerned, as if you are performing this transformation alongside any other column transformations the shape of the outputs will be different, so you won't be able to safely perform this transformation alongside any others . Filling nulls won't solve the issue here. But, with the user not executing themselves, you may have something in mind already, such as the threshold argument being a value that is guaranteed to be in the series, and therefore there being at least a one-length streak.

Oct 24 '23 09:10 TomBurdge

@MathieuCayssol In my latest branch, I gated augmented_dickey_fuller behind a UseAtOwnRisk warning. I posted about this on discord. Let's remove augmented_dickey_fuller from the list of features we want to test now.

Oct 25 '23 04:10 abstractqqq

@abstractqqq did you have expected behaviour in mind for streak_length_stats where there are no streaks? Right now it will filter to an empty Series/DataFrame at: y = y.filter(y.struct.field("values")).struct.field("lengths") Then, the outputs are:

if input is Series and no streaks: return None as all dictionary values i.e. {"min":None, "max":None, etc.}

(It needs a correction for mode in this case where input type Series and there are no streaks, but the error handling for this is fairly straightforward).

if input is DataFrame/LazyFrame and there are no streaks: return an expression which will evaluate to an empty DataFrame.

Is this behaviour okay? I'm thinking probably no, as the point in these panel transformations is to return a primitive for each panel. I think it makes more sense to return a single row DataFrame also when there are no streaks, but I am struggling to make this compatible with lazy execution.

Thank you for looking into this. This is my take:

For eager (series), I think it is fine to return dict with Nones, but maybe we fill min to be 0.

For lazy, I do not intend users to call this function themselves, meaning that in the namespace, we should create a method like extract_streak_stats, and we extract the min, max, avg... out from the struct and fill nulls accordingly. So I think it is better to keep it as a struct. Maybe we add an underscore prefix for the function to suggest to the user that we don't want them to use it from the tsfresh.py module.. I think namespace is the right place.

Nice, thanks that is useful and I have made the small changes to do as you propose for Series. I'll add this in a PR along with tests.

For LazyFrames with no streaks, this will currently return an empty struct column.

shape: (0, 1) ┌───────────┐ │ min │ │ --- │ │ struct[8] │ ╞═══════════╡ └───────────┘ Is that okay?

I'm concerned, as if you are performing this transformation alongside any other column transformations the shape of the outputs will be different, so you won't be able to safely perform this transformation alongside any others . Filling nulls won't solve the issue here. But, with the user not executing themselves, you may have something in mind already, such as the threshold argument being a value that is guaranteed to be in the series, and therefore there being at least a one-length streak.

I think the returning an empty row is OK if: If I do something like pl.col("a").tse.streak_length_stats().struct.field("mean"), then a Null returns.

I guess you are already working on this so I won't touch the code. Thanks for looking into this!

Oct 25 '23 04:10 abstractqqq

autoregressive_coefficients

Oct 25 '23 15:10 TomBurdge

benford_correlation2

Oct 26 '23 15:10 vienneraphael