featuretools
featuretools copied to clipboard
Gain more control over the application of where_primitives
Cannot create/control the application of where_primitives
independent of the agg_primitives
Use case: Say, I have a column for spend
. I create a seed_feature
to create buckets on the spend
column and add those buckets as its interesting_values
. I might want to create the feature min(spend)
on the whole. But, within the bucket [10000,15000], I might not want to create the min(spend where spend_bucket == 10000_15000)
. How do I go about having this kind of control where I control primitives application only when where
clause is in effect
Feature Request Description
The Source of the problem is that where_primitives are tightly coupled with agg_primitives.
- where_primitives that are also not specified under agg_primitives don't get used and hence result in warnings.warn(warning_msg, UnusedPrimitiveWarning)
- You can only control the primitive application at the level of the primitive name which causes the effect of any attempt to control the application of a primitive to apply to both
where_primitives
andagg_primitives
Currently, the primitives application can be controlled via ignore_entities
, ignore_variables
and primitive_options
arguments
ignore_entities
and ignore_variables
control the primitive application for the entire DFS run and thus are helpless in handling primitive application only when where clause is in effect
primitive_options
provide more granular control over the primitive application through its include_entities
, ignore_entities
, include_variables
, ignore_variables
, include_groupby_entities
, ignore_groupby_entities
, include_groupby_variables
, ignore_groupby_variables
keys. But, None of these options makes it possible to control the application of a primitive only when the where clause is in effect.
Expected Output
-
where_primitives
should not necessarily be a subset ofagg_primitives
- Have a separate
where_primitive_options
argument
Output of featuretools.show_info()
Featuretools version: 0.25.0 Featuretools installation directory: /home/nitin/miniconda3/envs/featuretools/lib/python3.9/site-packages/featuretools
SYSTEM INFO
python: 3.9.4.final.0 python-bits: 64 OS: Linux OS-release: 5.4.0-74-generic machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: en_IN LOCALE: en_IN.ISO8859-1
INSTALLED VERSIONS
numpy: 1.20.3 pandas: 1.2.4 tqdm: 4.61.1 PyYAML: 5.4.1 cloudpickle: 1.6.0 dask: 2021.6.0 distributed: 2021.6.0 psutil: 5.8.0 pip: 21.1.2 setuptools: 49.6.0.post20210108
Thanks for the request @nitinmnsn , this would be good to have!
I think removing the requirement that where_primitives
needs to be a subset of agg_primitives
is good.
I think instead of adding a where_primitive_options
parameter we could update primitive_options
to have some where specific filters. Users could avoid overlap between agg
and where
primitive options that share the same primitive by instantiating the primitive to differentiate them. Would that be sufficient?
Example:
where_mode = ft.primitives.Mode()
ft.dfs(...
agg_primitives=['mode'],
where_primitives=[where_mode],
primitive_options={'mode': ....,
where_mode: ...}
...)
Yes, that would be perfect!
- Removing the requirement that
where_primitives
needs to be a subset ofagg_primitives
would allow us to operate at just the string name of the primitives. Even in theprimitive_options
dictionary - In the case of the same primitives to be used between
agg_primitives
andwhere_primitives
, the distinction can be made as you suggest - "Users could avoid overlap between agg and where primitive options that share the same primitive by instantiating the primitive to differentiate them"