tsfresh
tsfresh copied to clipboard
Scalability of feature extraction
Thank you for this package!
Somewhat related to #356 , as far as I understand currently the input format is a pandas.DataFrame, then data is aggregated by the date
and uid
and finally features are extracted from individual time series. All functions in tsfresh/feature_extraction/feature_calculators.py
appear to take as input either 1D arrays or pd.Series
.
What I am wondering about is whether computing features from 2D arrays (with axis ('uid', 'time')
) would be computationally more efficient.
I have done a few performance benchmarks to investigate this by comparing,
- A. applying tsfresh directly to extract features on a columnar pd.DataFrame
with aggregating the DataFrame by user and date, to convert it to a labelled array (with xarray, for convenience)
- B. followed by computing individual features manually with _do_extraction_on_chunk
- C. re-implementing a few metrics in a vectorized fashion and applying it on the xarray directly
The results are available in the following notebooks,
- tsfresh_benchmark.ipynb run on a small sample_dataset.csv.gz (0.2 M rows, 10MB CSV uncompressed)
- tsfresh_benchmark_large.ipynb (second notebook in the gist, 14M rows, 700 MB CSV uncompressed), I can provide the data if necessary.
The conclusion appears to be that feature extraction on 2D arrays would be 10-30x faster. The parallelization is disabled, and these benchmarks are run on a 2 CPU core laptop for the following features {'abs_energy': None, 'absolute_sum_of_changes': None}
. So even even the default multi-threading in BLAS for numpy operations would not explain this difference.
In the end I'm trying to run feature extraction from tsfresh on ~5 GB of data (i.e. 10x larger than the larger benchmark) and in my rough estimation this would take half a day for all features. Of course dask parallelization can help, but I'm wondering if there is a computationally more efficient way of doing it before parallelization (that would also work with xarray).
Am I missing something? What do you think? I imagine in the use case for which this package was developed pandas.DataFrame made more sense than e.g. xarray to store time series data? Thank you.
Hi @rth,
thanks for the issue and the benchmarks, its always good to push the boundaries. From my point of view, you showed in your benchmark that it is possible to speed up the feature calculation for two, very specific features.
I am curious to see if the calculation of the other features can be speed up similarly. I have not yet worked with xarrays.
When performing your benchmark, you should keep in mind that apart from the pure calculation of features, tsfresh also contains logic around the extraction that eat up runtime. So, it is natural that tsfresh will show a longer runtime
What I am wondering about is whether computing features from 2D arrays (with axis ('uid', 'time')) would be computationally more efficient
this implies that all time series in the dataset have the same length, right? tsfresh supports datasets where the time series have different lenghts
I imagine in the use case for which this package was developed pandas.DataFrame made more sense than e.g. xarray to store time series data? Thank you.
there is no pressing need to use pandas dataframes as the time series container. we developed the package that way because we were used to work with pandas dataframes.
Thanks for your response @MaxBenChrist
From my point of view, you showed in your benchmark that it is possible to speed up the feature calculation for two, very specific features. I am curious to see if the calculation of the other features can be speed up similarly.
Well those two were somewhat randomly chosen. The main difference is between doing an operation on a 2D array vs a for loop on 1D arrays. I imagine in some cases, the former would be faster due to better vectorization in BLAS used by numpy. Though, currently that wouldn't directly work for models that require statsmodels
as the corresponding functions only support 1D arrays.
I have not yet worked with xarrays.
Well xarray was used mostly for convenience (to keeps labels associated with arrays dimensions), the same could be done just with numpy.arrays.
tsfresh also contains logic around the extraction that eat up runtime. So, it is natural that tsfresh will show a longer runtime
Yes, and the idea could be to do the corresponding validation etc once per dataset instead of for each 1D array. But I have not looked how this could work with e.g. feature selection in your pipeline.
this implies that all time series in the dataset have the same length, right?
Not necessarily. It implies that all time series are sampled on the same time grid. Ideally it would also be uniformly sampled (tsfresh also makes this assumption in a few places). To handle series of different length one would need to handle missing values (e.g. via NaNs) in the feature extraction metrics.
Anyway it's just some thoughts, I'm not sure this would make sense in the general use case of tslearn..
The main difference is between doing an operation on a 2D array vs a for loop on 1D arrays
FWIW, that's also more similar to the time series representation in tslearn...
Thanks for the detailed answer @rth
Not necessarily. It implies that all time series are sampled on the same time grid. Ideally it would also be uniformly sampled (tsfresh also makes this assumption in a few places). To handle series of different length one would need to handle missing values (e.g. via NaNs) in the feature extraction metrics.
I am curious about the overhead and overall runtime costs if each feature calculator is checking for NaNs.
Well those two were somewhat randomly chosen. The main difference is between doing an operation on a 2D array vs a for loop on 1D arrays. I imagine in some cases, the former would be faster due to better vectorization in BLAS used by numpy. Though, currently that wouldn't directly work for models that require statsmodels as the corresponding functions only support 1D arrays.
So, in essence we need to identify features that benefit from being calculated under BLAS on two dimensional arrays.
Are you interested in working on this?
I am curious about the overhead and overall runtime costs if each feature calculator is checking for NaNs.
Yes, I imagine it could be significant, which would defeat the purpose a bit.
So, in essence we need to identify features that benefit from being calculated under BLAS on two dimensional arrays.
True. But also I realised that in my use case some feature were taking much more time than others (even those not marker with high_comp_cost=True
). In particular, change_quantiles
was taking more time than all the other features combined.
I tried to re-run these benchmarks on the smaller subset of the data here (it's the 3rd notebook in the gist), but I can not reproduce it. I guess it might be data size / dataset dependent.
Are you interested in working on this?
Unfortunately, I won't have the bandwidth to work on it anymore, but I hope these benchmarks could help the person who would.
@rth have you spent any time on implementing the feature extractors for xarray? I think you are onto something here.
No, I haven't looked into this since 2018.
Just if anyone is interested (slightly related): I posted about the scaling behavior of tsfresh here: https://nils-braun.github.io/execution-time/