featuretools
featuretools copied to clipboard
Generate stacked transform features within a single table if max_depth is greater than 1
- As a user, I wish I could use Featuretools to generate stacked transform features within a single table. Currently DFS will not generate any stacked transform features within a single table, regardless of the setting of the
max_depth
parameter. Allowing for stacking of features within a table whenmax_depth
is greater than one could be useful in some situations.
One potential use case of this would be to generate features that capture interactions between transform features. For example if we consider the case of trying to determine whether a given datetime falls during lunch time on a weekend, this could be generated by performing a boolean multiply on the features generated by the primitives IsLunchTime
and IsWeekend
. Currently only the boolean features from IsLunchTime
and IsWeekend
will be generated by DFS, but the stacked boolean multiplication feature will not be generated automatically.
Users can manually define these types of stacked transform features, but it could be beneficial for DFS to handle this.
One thing of note is that you can apply primitives in different orders and get different sets of primitives, and I think that can be really apparent with transform stacking. For example: If you had no boolean columns but included the primitives IsNull
and And
, you would not get any use of And
if it gets applied before IsNull
. And I think we made changes at some point to sort the inputted primitives so that if users put in the same primitives but in different orders, you don't get out different sets of primitives.
Another thing to worry about with transform stacking is that you can infinitely stack transform primitives, and that can lead you down a really long hole of not useful primitives getting stacked upon each other. But I think that kind of exists with agg primitives too, so probably not a huge issue.
My thought is that this would work in "passes", with the number of passes being equal to the max depth setting. The first pass would generate the features that we currently generate. Then if max_depth is set to more than 1, we would make another pass through and generate additional features based on the features generated from the first pass, continuing on with this process until max_depth is reached at which time we would stop.
I think the number of features generated could be a real concern here though, and we would need to think through what this looks like in a multi-table setup. I'm also sure I'm over-simplifying this in my mind.