tsfresh Efficient rolling window of time series (e.g. while using temporary results)

Efficient rolling window of time series (e.g. while using temporary results)

Open xgdgsc opened this issue 7 years ago • 13 comments

I want to extract features from a rolling window of a table with columns of several timeserieses and do some prediction based on the timeseries in that window. Currently, as far as I understand the doc. I have to extract the timeseries and tile them like in the example, so there would be a lot of duplicate data because the rolling window and doesn' t seem memory efficient. Is there a rolling window API or better ways to do it?

Thanks!

Nov 30 '16 14:11 xgdgsc

we have not yet implemented a rolling window api to do that efficiently.

To implement such a API, one would have to decide for every feature calculator if this calculator can use the result of the last window for the current window. For some features this can be easily done (maximum, mean, ..) but for others this is not trivially possible (median, wavelet coefficients, ...)

Nov 30 '16 19:11 MaxBenChrist

This is pretty critical for all the problems I'm working on as well.

Dec 02 '16 03:12 ClimbsRocks

For most use cases that involve to forecast time series this can reduce the time to calculate the features.

But as stated above, for a class of features is it mathematically impossible to use auxiliary results from the last window.

Dec 02 '16 14:12 MaxBenChrist

If somebody of you wants to implement this, I would be glad to help you with the design decisions. I will probably not have time for this during the next months

Dec 02 '16 14:12 MaxBenChrist

Maybe we should provide a wrapper for the translation of timestamp, value combinations into rolling window time series.

It is pretty straightforward to implement with http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.shift.html applied to a groupby. (http://pandas.pydata.org/pandas-docs/version/0.17.0/generated/pandas.core.groupby.DataFrameGroupBy.shift.html)

Afterwards one has to drop NaNs. I already did this a few times. I have to check if I can dig up some snippets.

Mar 16 '17 12:03 MaxBenChrist

Lets tackle this, the following code will do the treansformation:

import numpy as np
import pandas as pd

cid = np.repeat([10, 500]*2, 4)
csort = [1,2,3,4]*4
 cval = [11, 9, 67, 45] * 4
ckind= np.repeat(["a", "b"], 8)
df = pd.DataFrame({"id": cid, "sort": csort, "val": cval, 'kind': ckind})

n = max(cid)
lst_df = []

for i in range(n):
    df_temp = df.groupby(["id", "kind"]).shift(-i)
    df_temp["id"] = "id=" + df.id.map(str) + ", shift={}".format(i)
    df_temp["kind"] = df.kind
    df_temp.dropna(inplace=True)
    lst_df.append(df_temp)

df_ready = pd.concat(lst_df).reset_index()

Mar 16 '17 12:03 MaxBenChrist

Before

In [21]: df Out[21]: id kind sort val 0 10 a 1 11 1 10 a 2 9 2 10 a 3 67 3 10 a 4 45 4 500 a 1 11 5 500 a 2 9 6 500 a 3 67 7 500 a 4 45 8 10 b 1 11 9 10 b 2 9 10 10 b 3 67 11 10 b 4 45 12 500 b 1 11 13 500 b 2 9 14 500 b 3 67 15 500 b 4 45

and afterwards

index  sort   val               id kind

0 0 1.0 11.0 id=10, shift=0 a 1 1 2.0 9.0 id=10, shift=0 a 2 2 3.0 67.0 id=10, shift=0 a 3 3 4.0 45.0 id=10, shift=0 a 4 4 1.0 11.0 id=500, shift=0 a 5 5 2.0 9.0 id=500, shift=0 a 6 6 3.0 67.0 id=500, shift=0 a 7 7 4.0 45.0 id=500, shift=0 a 8 8 1.0 11.0 id=10, shift=0 b 9 9 2.0 9.0 id=10, shift=0 b 10 10 3.0 67.0 id=10, shift=0 b 11 11 4.0 45.0 id=10, shift=0 b 12 12 1.0 11.0 id=500, shift=0 b 13 13 2.0 9.0 id=500, shift=0 b 14 14 3.0 67.0 id=500, shift=0 b 15 15 4.0 45.0 id=500, shift=0 b 16 0 2.0 9.0 id=10, shift=1 a 17 1 3.0 67.0 id=10, shift=1 a 18 2 4.0 45.0 id=10, shift=1 a 19 4 2.0 9.0 id=500, shift=1 a 20 5 3.0 67.0 id=500, shift=1 a 21 6 4.0 45.0 id=500, shift=1 a 22 8 2.0 9.0 id=10, shift=1 b 23 9 3.0 67.0 id=10, shift=1 b 24 10 4.0 45.0 id=10, shift=1 b 25 12 2.0 9.0 id=500, shift=1 b 26 13 3.0 67.0 id=500, shift=1 b 27 14 4.0 45.0 id=500, shift=1 b 28 0 3.0 67.0 id=10, shift=2 a 29 1 4.0 45.0 id=10, shift=2 a 30 4 3.0 67.0 id=500, shift=2 a 31 5 4.0 45.0 id=500, shift=2 a 32 8 3.0 67.0 id=10, shift=2 b 33 9 4.0 45.0 id=10, shift=2 b 34 12 3.0 67.0 id=500, shift=2 b 35 13 4.0 45.0 id=500, shift=2 b 36 0 4.0 45.0 id=10, shift=3 a 37 4 4.0 45.0 id=500, shift=3 a 38 8 4.0 45.0 id=10, shift=3 b 39 12 4.0 45.0 id=500, shift=3 b

Mar 16 '17 12:03 MaxBenChrist

I don't hsve the time to do unit test & think about where to put this, @jneuff @moritzgelb or @nils-braun, can one of you add the snippet to a pr?

Mar 16 '17 12:03 MaxBenChrist

I can tackle this tomorrow or on Saturday :-) If someone is faster, no problem

Mar 16 '17 12:03 nils-braun

Ok, I have started a branch and working on this. Still needs some documentation, but will be ready to go on this weekend!

Mar 18 '17 11:03 nils-braun

This should now be possible in the HEAD version :-) Still needs an example notebook, but you can already read it here: http://tsfresh.readthedocs.io/en/latest/text/rolling.html

Apr 01 '17 21:04 nils-braun

I leave this issue open, as we may implement a more efficient solution later

Apr 01 '17 21:04 nils-braun

This is awesome! Thanks for the great work on this, team. You've now allowed me to use tsfresh with an entire new class of projects.

Apr 04 '17 15:04 ClimbsRocks

tsfresh tsfresh copied to clipboard

Efficient rolling window of time series (e.g. while using temporary results)

tsfresh
tsfresh copied to clipboard