featuretools
featuretools copied to clipboard
Include relationship variables in Feature.get_dependencies
If we have a relationship log.session_id -> sessions.id
and a feature sessions: MEAN(log.value)
there is an implicit dependency on the features sessions: id
and log: session_id
– to calculate the aggregation we need to be able to join using them.
In FeatureSetCalculator
we currently keep all ID columns in intermediate feature matrices because we don't know whether there are any features which depend on them:
https://github.com/Featuretools/featuretools/blob/1bc5f971af86fb309f87d7f0bb7db0e6fe29d604/featuretools/computational_backends/feature_set_calculator.py#L630-L636
If get_dependencies
included the variables which are needed for joining (as IdentityFeatures
) then these features would be included in the feature_trie
and we could keep only the columns which are actually necessary.
This is only tangential, but we could also get rid of the first element of the 3-tuples stored in feature_trie
and simplify its construction some what. Since every node would now have at least one feature, if the first of the two sets is empty then the full entity is not needed. I.e. this would now always be true:
needs_full_entity, full_entity_features, _ = node.value
assert needs_full_entity == (len(full_entity_features) > 0)