featuretools icon indicating copy to clipboard operation
featuretools copied to clipboard

Include relationship variables in Feature.get_dependencies

Open CJStadler opened this issue 5 years ago • 0 comments

If we have a relationship log.session_id -> sessions.id and a feature sessions: MEAN(log.value) there is an implicit dependency on the features sessions: id and log: session_id – to calculate the aggregation we need to be able to join using them.

In FeatureSetCalculator we currently keep all ID columns in intermediate feature matrices because we don't know whether there are any features which depend on them: https://github.com/Featuretools/featuretools/blob/1bc5f971af86fb309f87d7f0bb7db0e6fe29d604/featuretools/computational_backends/feature_set_calculator.py#L630-L636

If get_dependencies included the variables which are needed for joining (as IdentityFeatures) then these features would be included in the feature_trie and we could keep only the columns which are actually necessary.

This is only tangential, but we could also get rid of the first element of the 3-tuples stored in feature_trie and simplify its construction some what. Since every node would now have at least one feature, if the first of the two sets is empty then the full entity is not needed. I.e. this would now always be true:

needs_full_entity, full_entity_features, _ = node.value
assert needs_full_entity == (len(full_entity_features) > 0)

CJStadler avatar Jul 22 '19 19:07 CJStadler