functime icon indicating copy to clipboard operation
functime copied to clipboard

[UNIT TEST TRACKER] tsfresh

Open MathieuCayssol opened this issue 2 years ago • 18 comments

Tracker for the unit test. The unit tests should cover pl.Series, pl.Expr for eager and lazy (if implemented).

  • [x] absolute_energy
  • [x] absolute_maximum
  • [x] absolute_sum_of_changes
  • [x] approximate_entropy
  • [x] autocorrelation
  • [ ] autoregressive_coefficients
  • [x] benford_correlation
  • [x] binned_entropy
  • [x] c3
  • [x] change_quantiles
  • [x] cid_ce
  • [x] count_above
  • [x] count_above_mean
  • [x] count_below
  • [x] count_below_mean
  • [x] energy_ratios
  • [x] first_location_of_maximum
  • [x] first_location_of_minimum
  • [x] has_duplicate
  • [x] has_duplicate_max
  • [x] has_duplicate_min
  • [x] index_mass_quantile
  • [x] large_standard_deviation
  • [x] last_location_of_maximum
  • [x] last_location_of_minimum
  • [x] longest_strike_above_mean
  • [x] longest_strike_below_mean
  • [x] mean_abs_change
  • [x] mean_change
  • [x] mean_n_absolute_max
  • [x] mean_second_derivative_central
  • [x] number_crossings
  • [ ] number_cwt_peaks
  • [x] number_peaks
  • [x] percent_reoccuring_values
  • [x] percent_reocurring_points
  • [x] permutation_entropy
  • [x] range_count
  • [x] ratio_beyond_r_sigma
  • [x] ratio_n_unique_to_length
  • [x] root_mean_square
  • [x] sample_entropy
  • [ ] spkt_welch_density
  • [x] sum_reocurring_points
  • [x] sum_reocurring_values
  • [x] symmetry_looking
  • [x] time_reversal_asymmetry_statistic
  • [x] variation_coefficient
  • [x] var_gt_std
  • [ ] cwt_coefficients
  • [x] fourier_entropy
  • [ ] friedrich_coefficients
  • [x] lempel_ziv_complexity
  • [x] linear_trend
  • [ ] partial_autocorrelation

Additional Features:

  • [x] - range_over_mean (This and range_change are tested but tests are non-exhaustive.)
  • [x] - range_change
  • [x] - longest_winning_streak (special case of longest_streak_above)
  • [x] - longest_losing_streak (special case of longest_streak_below)
  • [x] - streak_length_stats
  • [x] - longest_streak_above
  • [x] - longest_streak_below
  • [x] - max_abs_change

MathieuCayssol avatar Oct 11 '23 19:10 MathieuCayssol

currently working on:

  • variation_coefficient,
  • var_gt_std,
  • large_standard_deviation
  • range_count

TomBurdge avatar Oct 16 '23 18:10 TomBurdge

  • ratio_beyond_r_sigma
  • ratio_n_unique_to_length
  • root_mean_square

MathieuCayssol avatar Oct 16 '23 20:10 MathieuCayssol

  • variation_coefficient
  • var_gt_std
  • large_standard_deviation
  • range_count
  • mean_change
  • mean_abs_change

submitted for pull request.

TomBurdge avatar Oct 17 '23 10:10 TomBurdge

The entropy ones are done in my PR https://github.com/neocortexdb/functime/pull/91

abstractqqq avatar Oct 18 '23 06:10 abstractqqq

I'm going to start on:

  • linear_trend.

I will work on it in a new branch and make a separate PR for ease of review.

TomBurdge avatar Oct 18 '23 13:10 TomBurdge

As discussed in discord, we will follow:

https://en.wikipedia.org/wiki/IEEE_754

In summary:

  • 0/0 -> np.nan
  • ∞×0 -> np.nan
  • x > 0, x/0 -> np.inf
  • x < 0, x/0 -> np.NINF

After we are done with the unit tests, we will double check to verify our conformity to IEEE 754

MathieuCayssol avatar Oct 18 '23 16:10 MathieuCayssol

I have made PR for linear trend.

TomBurdge avatar Oct 19 '23 11:10 TomBurdge

  • number_crossings
  • number_cwt_peaks
  • autoregressive_coefficients

claysmyth avatar Oct 20 '23 16:10 claysmyth

Did number_crossings today @claysmyth, sorry I forgot to add it here.

plaales avatar Oct 20 '23 17:10 plaales

  • friedrich_coefficients

plaales avatar Oct 21 '23 11:10 plaales

  • streak_length_stats

TomBurdge avatar Oct 22 '23 18:10 TomBurdge

@abstractqqq did you have expected behaviour in mind for streak_length_stats where there are no streaks?

Right now it will filter to an empty Series/DataFrame at: y = y.filter(y.struct.field("values")).struct.field("lengths")

Then, the outputs are:

  • if input is Series and no streaks: return None as all dictionary values i.e. {"min":None, "max":None, etc.}

(It needs a correction for mode in this case where input type Series and there are no streaks, but the error handling for this is fairly straightforward).

  • if input is DataFrame/LazyFrame and there are no streaks: return an expression which will evaluate to an empty DataFrame.

Is this behaviour okay? I'm thinking probably no, as the point in these panel transformations is to return a primitive for each panel.

I think it makes more sense to return a single row DataFrame also when there are no streaks, but I am struggling to make this compatible with lazy execution.

TomBurdge avatar Oct 23 '23 15:10 TomBurdge

@abstractqqq did you have expected behaviour in mind for streak_length_stats where there are no streaks?

Right now it will filter to an empty Series/DataFrame at: y = y.filter(y.struct.field("values")).struct.field("lengths")

Then, the outputs are:

  • if input is Series and no streaks: return None as all dictionary values i.e. {"min":None, "max":None, etc.}

(It needs a correction for mode in this case where input type Series and there are no streaks, but the error handling for this is fairly straightforward).

  • if input is DataFrame/LazyFrame and there are no streaks: return an expression which will evaluate to an empty DataFrame.

Is this behaviour okay? I'm thinking probably no, as the point in these panel transformations is to return a primitive for each panel.

I think it makes more sense to return a single row DataFrame also when there are no streaks, but I am struggling to make this compatible with lazy execution.

Thank you for looking into this. This is my take:

  1. For eager (series), I think it is fine to return dict with Nones, but maybe we fill min to be 0.
  2. For lazy, I do not intend users to call this function themselves, meaning that in the namespace, we should create a method like extract_streak_stats, and we extract the min, max, avg... out from the struct and fill nulls accordingly. So I think it is better to keep it as a struct. Maybe we add an underscore prefix for the function to suggest to the user that we don't want them to use it from the tsfresh.py module.. I think namespace is the right place.

abstractqqq avatar Oct 23 '23 17:10 abstractqqq

@abstractqqq did you have expected behaviour in mind for streak_length_stats where there are no streaks? Right now it will filter to an empty Series/DataFrame at: y = y.filter(y.struct.field("values")).struct.field("lengths") Then, the outputs are:

  • if input is Series and no streaks: return None as all dictionary values i.e. {"min":None, "max":None, etc.}

(It needs a correction for mode in this case where input type Series and there are no streaks, but the error handling for this is fairly straightforward).

  • if input is DataFrame/LazyFrame and there are no streaks: return an expression which will evaluate to an empty DataFrame.

Is this behaviour okay? I'm thinking probably no, as the point in these panel transformations is to return a primitive for each panel. I think it makes more sense to return a single row DataFrame also when there are no streaks, but I am struggling to make this compatible with lazy execution.

Thank you for looking into this. This is my take:

  1. For eager (series), I think it is fine to return dict with Nones, but maybe we fill min to be 0.
  2. For lazy, I do not intend users to call this function themselves, meaning that in the namespace, we should create a method like extract_streak_stats, and we extract the min, max, avg... out from the struct and fill nulls accordingly. So I think it is better to keep it as a struct. Maybe we add an underscore prefix for the function to suggest to the user that we don't want them to use it from the tsfresh.py module.. I think namespace is the right place.

Nice, thanks that is useful and I have made the small changes to do as you propose for Series. I'll add this in a PR along with tests.

For LazyFrames with no streaks, this will currently return an empty struct column.

shape: (0, 1) ┌───────────┐ │ min │ │ --- │ │ struct[8] │ ╞═══════════╡ └───────────┘ Is that okay?

I'm concerned, as if you are performing this transformation alongside any other column transformations the shape of the outputs will be different, so you won't be able to safely perform this transformation alongside any others . Filling nulls won't solve the issue here. But, with the user not executing themselves, you may have something in mind already, such as the threshold argument being a value that is guaranteed to be in the series, and therefore there being at least a one-length streak.

TomBurdge avatar Oct 24 '23 09:10 TomBurdge

@MathieuCayssol In my latest branch, I gated augmented_dickey_fuller behind a UseAtOwnRisk warning. I posted about this on discord. Let's remove augmented_dickey_fuller from the list of features we want to test now.

abstractqqq avatar Oct 25 '23 04:10 abstractqqq

@abstractqqq did you have expected behaviour in mind for streak_length_stats where there are no streaks? Right now it will filter to an empty Series/DataFrame at: y = y.filter(y.struct.field("values")).struct.field("lengths") Then, the outputs are:

  • if input is Series and no streaks: return None as all dictionary values i.e. {"min":None, "max":None, etc.}

(It needs a correction for mode in this case where input type Series and there are no streaks, but the error handling for this is fairly straightforward).

  • if input is DataFrame/LazyFrame and there are no streaks: return an expression which will evaluate to an empty DataFrame.

Is this behaviour okay? I'm thinking probably no, as the point in these panel transformations is to return a primitive for each panel. I think it makes more sense to return a single row DataFrame also when there are no streaks, but I am struggling to make this compatible with lazy execution.

Thank you for looking into this. This is my take:

  1. For eager (series), I think it is fine to return dict with Nones, but maybe we fill min to be 0.
  2. For lazy, I do not intend users to call this function themselves, meaning that in the namespace, we should create a method like extract_streak_stats, and we extract the min, max, avg... out from the struct and fill nulls accordingly. So I think it is better to keep it as a struct. Maybe we add an underscore prefix for the function to suggest to the user that we don't want them to use it from the tsfresh.py module.. I think namespace is the right place.

Nice, thanks that is useful and I have made the small changes to do as you propose for Series. I'll add this in a PR along with tests.

For LazyFrames with no streaks, this will currently return an empty struct column.

shape: (0, 1) ┌───────────┐ │ min │ │ --- │ │ struct[8] │ ╞═══════════╡ └───────────┘ Is that okay?

I'm concerned, as if you are performing this transformation alongside any other column transformations the shape of the outputs will be different, so you won't be able to safely perform this transformation alongside any others . Filling nulls won't solve the issue here. But, with the user not executing themselves, you may have something in mind already, such as the threshold argument being a value that is guaranteed to be in the series, and therefore there being at least a one-length streak.

I think the returning an empty row is OK if: If I do something like pl.col("a").tse.streak_length_stats().struct.field("mean"), then a Null returns.

I guess you are already working on this so I won't touch the code. Thanks for looking into this!

abstractqqq avatar Oct 25 '23 04:10 abstractqqq

  • autoregressive_coefficients

TomBurdge avatar Oct 25 '23 15:10 TomBurdge

  • benford_correlation2

vienneraphael avatar Oct 26 '23 15:10 vienneraphael