featuretools
featuretools copied to clipboard
Generate features on multi time series
Hello,
I am going to evaluate some frameworks on feature engineering automation. And I read some notebooks on featuretools. I am really attracted by the entityset concept. And here are some questions about featuretools and time series dataset. Say there are three datasets: df_A
, df_B
and df_C
, all of these dataset have the column named datetime
. df_A
containes one record per hour, df_B
contains 4 records per day and df_C
contains 2-6 records per day. The questions are
- How can I set the relationships of those dataset? by date?
- Does featuretools support generating features from multi time series?
Finally, thanks for the great framework!
Hi @jacktang,
Would you be able to clarify how you see the relationships between these different dataframes as well as the multiple time series in the problem you're trying to solve? A few scenarios I could imagine:
- There's a one-to-many relationship between the dataframes you presented. For example: The one with hourly data would have 6 records per observation in the one with 4 records per day. There might be another variable that connects the dataframes (say a user id) that could be used as foreign keys between dataframes or you would have to come up with your own grouping.
- Each of the dataframes is a completely different time series for modeling, and you want to connect them with a fourth parent dataframe that groups the information on a larger time span (daily, for example)
Depending on which option makes sense for you, you would set up the EntitySet differently and use Primitives differently, but it might be helpful to look at our guides for how to handle time in Featuretools and what time series functionality exists in Featuretools .
Hi @tamargrey,
the problem scenario is wind turbine power generation analysis. For each wind turbine, I have df_wind_speed
, df_power_generated
and df_weather_forecast
dataframes, all of them contain datetime
column, but the values are not aligned exactly and the sampling frequencies are not same.
I've read the resources you mentioned, especially the time series
notebook, it is very useful to one time series dataset. And I did not find the method to set the relationship among these dataframes, the common column field of them is the same date
. So I am going to extract one dataframe named df_date
, which contain only date
column, and add date
column to df_wind_speed
, df_power_generated
and df_weather_forecast
, and set the relationship on date
. Is it right way to do with featuretools on multi time series? Or should I interpolate the values to get the exactly datetime and connect them?
@jacktang Thanks for explaining. I think having a df_date
datataframe makes a lot of sense. I think what you're describing makes sense to be able to leverage Featuretools with multiple time series.
One limitation you may run into is that you'd have to make date
your index and would not also be able to set it as a time index, so to avoid that, I might suggest df_date
contain two columns--date
(your time index) and id
(your index)--and that the children dataframes all get added date_id
columns that are the foreign keys.
So if id
of 1 was the time span of January 1st 2022 to January 2nd 2022, all dates in the child dataframes between 1-1-22 and 1-2-22 would get a date_id
of 1. That way you could run dfs with date_df
as your target dataframe and get aggregations across various date_ids and perform groupbys and whatnot.
@tamargrey Thanks for the suggestions! I will try it next week.