tsfresh icon indicating copy to clipboard operation
tsfresh copied to clipboard

Scalability of feature extraction

Open rth opened this issue 6 years ago • 8 comments

Thank you for this package!

Somewhat related to #356 , as far as I understand currently the input format is a pandas.DataFrame, then data is aggregated by the date and uid and finally features are extracted from individual time series. All functions in tsfresh/feature_extraction/feature_calculators.py appear to take as input either 1D arrays or pd.Series.

What I am wondering about is whether computing features from 2D arrays (with axis ('uid', 'time')) would be computationally more efficient.

I have done a few performance benchmarks to investigate this by comparing,

  • A. applying tsfresh directly to extract features on a columnar pd.DataFrame

with aggregating the DataFrame by user and date, to convert it to a labelled array (with xarray, for convenience)

  • B. followed by computing individual features manually with _do_extraction_on_chunk
  • C. re-implementing a few metrics in a vectorized fashion and applying it on the xarray directly

The results are available in the following notebooks,

The conclusion appears to be that feature extraction on 2D arrays would be 10-30x faster. The parallelization is disabled, and these benchmarks are run on a 2 CPU core laptop for the following features {'abs_energy': None, 'absolute_sum_of_changes': None}. So even even the default multi-threading in BLAS for numpy operations would not explain this difference.

In the end I'm trying to run feature extraction from tsfresh on ~5 GB of data (i.e. 10x larger than the larger benchmark) and in my rough estimation this would take half a day for all features. Of course dask parallelization can help, but I'm wondering if there is a computationally more efficient way of doing it before parallelization (that would also work with xarray).

Am I missing something? What do you think? I imagine in the use case for which this package was developed pandas.DataFrame made more sense than e.g. xarray to store time series data? Thank you.

rth avatar Feb 23 '18 16:02 rth

Hi @rth,

thanks for the issue and the benchmarks, its always good to push the boundaries. From my point of view, you showed in your benchmark that it is possible to speed up the feature calculation for two, very specific features.

I am curious to see if the calculation of the other features can be speed up similarly. I have not yet worked with xarrays.

When performing your benchmark, you should keep in mind that apart from the pure calculation of features, tsfresh also contains logic around the extraction that eat up runtime. So, it is natural that tsfresh will show a longer runtime

What I am wondering about is whether computing features from 2D arrays (with axis ('uid', 'time')) would be computationally more efficient

this implies that all time series in the dataset have the same length, right? tsfresh supports datasets where the time series have different lenghts

I imagine in the use case for which this package was developed pandas.DataFrame made more sense than e.g. xarray to store time series data? Thank you.

there is no pressing need to use pandas dataframes as the time series container. we developed the package that way because we were used to work with pandas dataframes.

MaxBenChrist avatar Feb 24 '18 19:02 MaxBenChrist

Thanks for your response @MaxBenChrist

From my point of view, you showed in your benchmark that it is possible to speed up the feature calculation for two, very specific features. I am curious to see if the calculation of the other features can be speed up similarly.

Well those two were somewhat randomly chosen. The main difference is between doing an operation on a 2D array vs a for loop on 1D arrays. I imagine in some cases, the former would be faster due to better vectorization in BLAS used by numpy. Though, currently that wouldn't directly work for models that require statsmodels as the corresponding functions only support 1D arrays.

I have not yet worked with xarrays.

Well xarray was used mostly for convenience (to keeps labels associated with arrays dimensions), the same could be done just with numpy.arrays.

tsfresh also contains logic around the extraction that eat up runtime. So, it is natural that tsfresh will show a longer runtime

Yes, and the idea could be to do the corresponding validation etc once per dataset instead of for each 1D array. But I have not looked how this could work with e.g. feature selection in your pipeline.

this implies that all time series in the dataset have the same length, right?

Not necessarily. It implies that all time series are sampled on the same time grid. Ideally it would also be uniformly sampled (tsfresh also makes this assumption in a few places). To handle series of different length one would need to handle missing values (e.g. via NaNs) in the feature extraction metrics.

Anyway it's just some thoughts, I'm not sure this would make sense in the general use case of tslearn..

rth avatar Feb 26 '18 14:02 rth

The main difference is between doing an operation on a 2D array vs a for loop on 1D arrays

FWIW, that's also more similar to the time series representation in tslearn...

rth avatar Feb 26 '18 15:02 rth

Thanks for the detailed answer @rth

Not necessarily. It implies that all time series are sampled on the same time grid. Ideally it would also be uniformly sampled (tsfresh also makes this assumption in a few places). To handle series of different length one would need to handle missing values (e.g. via NaNs) in the feature extraction metrics.

I am curious about the overhead and overall runtime costs if each feature calculator is checking for NaNs.

Well those two were somewhat randomly chosen. The main difference is between doing an operation on a 2D array vs a for loop on 1D arrays. I imagine in some cases, the former would be faster due to better vectorization in BLAS used by numpy. Though, currently that wouldn't directly work for models that require statsmodels as the corresponding functions only support 1D arrays.

So, in essence we need to identify features that benefit from being calculated under BLAS on two dimensional arrays.

Are you interested in working on this?

MaxBenChrist avatar Mar 01 '18 17:03 MaxBenChrist

I am curious about the overhead and overall runtime costs if each feature calculator is checking for NaNs.

Yes, I imagine it could be significant, which would defeat the purpose a bit.

So, in essence we need to identify features that benefit from being calculated under BLAS on two dimensional arrays.

True. But also I realised that in my use case some feature were taking much more time than others (even those not marker with high_comp_cost=True). In particular, change_quantiles was taking more time than all the other features combined. I tried to re-run these benchmarks on the smaller subset of the data here (it's the 3rd notebook in the gist), but I can not reproduce it. I guess it might be data size / dataset dependent.

Are you interested in working on this?

Unfortunately, I won't have the bandwidth to work on it anymore, but I hope these benchmarks could help the person who would.

rth avatar Mar 08 '18 09:03 rth

@rth have you spent any time on implementing the feature extractors for xarray? I think you are onto something here.

mmann1123 avatar Nov 17 '19 21:11 mmann1123

No, I haven't looked into this since 2018.

rth avatar Nov 17 '19 21:11 rth

Just if anyone is interested (slightly related): I posted about the scaling behavior of tsfresh here: https://nils-braun.github.io/execution-time/

nils-braun avatar May 03 '20 13:05 nils-braun