featuretools icon indicating copy to clipboard operation
featuretools copied to clipboard

Generate features on multi time series

Open jacktang opened this issue 2 years ago • 4 comments

Hello,

I am going to evaluate some frameworks on feature engineering automation. And I read some notebooks on featuretools. I am really attracted by the entityset concept. And here are some questions about featuretools and time series dataset. Say there are three datasets: df_A, df_B and df_C, all of these dataset have the column named datetime. df_A containes one record per hour, df_B contains 4 records per day and df_C contains 2-6 records per day. The questions are

  1. How can I set the relationships of those dataset? by date?
  2. Does featuretools support generating features from multi time series?

Finally, thanks for the great framework!

jacktang avatar Aug 10 '22 13:08 jacktang

Hi @jacktang,

Would you be able to clarify how you see the relationships between these different dataframes as well as the multiple time series in the problem you're trying to solve? A few scenarios I could imagine:

  • There's a one-to-many relationship between the dataframes you presented. For example: The one with hourly data would have 6 records per observation in the one with 4 records per day. There might be another variable that connects the dataframes (say a user id) that could be used as foreign keys between dataframes or you would have to come up with your own grouping.
  • Each of the dataframes is a completely different time series for modeling, and you want to connect them with a fourth parent dataframe that groups the information on a larger time span (daily, for example)

Depending on which option makes sense for you, you would set up the EntitySet differently and use Primitives differently, but it might be helpful to look at our guides for how to handle time in Featuretools and what time series functionality exists in Featuretools .

tamargrey avatar Aug 10 '22 14:08 tamargrey

Hi @tamargrey,

the problem scenario is wind turbine power generation analysis. For each wind turbine, I have df_wind_speed, df_power_generated and df_weather_forecast dataframes, all of them contain datetime column, but the values are not aligned exactly and the sampling frequencies are not same.

I've read the resources you mentioned, especially the time series notebook, it is very useful to one time series dataset. And I did not find the method to set the relationship among these dataframes, the common column field of them is the same date. So I am going to extract one dataframe named df_date, which contain only date column, and add date column to df_wind_speed, df_power_generated and df_weather_forecast, and set the relationship on date. Is it right way to do with featuretools on multi time series? Or should I interpolate the values to get the exactly datetime and connect them?

jacktang avatar Aug 11 '22 03:08 jacktang

@jacktang Thanks for explaining. I think having a df_date datataframe makes a lot of sense. I think what you're describing makes sense to be able to leverage Featuretools with multiple time series.

One limitation you may run into is that you'd have to make date your index and would not also be able to set it as a time index, so to avoid that, I might suggest df_date contain two columns--date (your time index) and id (your index)--and that the children dataframes all get added date_id columns that are the foreign keys.

So if id of 1 was the time span of January 1st 2022 to January 2nd 2022, all dates in the child dataframes between 1-1-22 and 1-2-22 would get a date_id of 1. That way you could run dfs with date_df as your target dataframe and get aggregations across various date_ids and perform groupbys and whatnot.

tamargrey avatar Aug 11 '22 13:08 tamargrey

@tamargrey Thanks for the suggestions! I will try it next week.

jacktang avatar Aug 12 '22 01:08 jacktang