featuretools icon indicating copy to clipboard operation
featuretools copied to clipboard

Gain more control over the application of where_primitives

Open nitinmnsn opened this issue 3 years ago • 2 comments

Cannot create/control the application of where_primitives independent of the agg_primitives

Use case: Say, I have a column for spend. I create a seed_feature to create buckets on the spend column and add those buckets as its interesting_values. I might want to create the feature min(spend) on the whole. But, within the bucket [10000,15000], I might not want to create the min(spend where spend_bucket == 10000_15000). How do I go about having this kind of control where I control primitives application only when where clause is in effect

Feature Request Description

The Source of the problem is that where_primitives are tightly coupled with agg_primitives.

  1. where_primitives that are also not specified under agg_primitives don't get used and hence result in warnings.warn(warning_msg, UnusedPrimitiveWarning)
  2. You can only control the primitive application at the level of the primitive name which causes the effect of any attempt to control the application of a primitive to apply to both where_primitives and agg_primitives

Currently, the primitives application can be controlled via ignore_entities, ignore_variables and primitive_options arguments ignore_entities and ignore_variables control the primitive application for the entire DFS run and thus are helpless in handling primitive application only when where clause is in effect primitive_options provide more granular control over the primitive application through its include_entities, ignore_entities, include_variables, ignore_variables, include_groupby_entities, ignore_groupby_entities, include_groupby_variables, ignore_groupby_variables keys. But, None of these options makes it possible to control the application of a primitive only when the where clause is in effect.

Expected Output

  1. where_primitives should not necessarily be a subset of agg_primitives
  2. Have a separate where_primitive_options argument

Output of featuretools.show_info()

Featuretools version: 0.25.0 Featuretools installation directory: /home/nitin/miniconda3/envs/featuretools/lib/python3.9/site-packages/featuretools

SYSTEM INFO

python: 3.9.4.final.0 python-bits: 64 OS: Linux OS-release: 5.4.0-74-generic machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: en_IN LOCALE: en_IN.ISO8859-1

INSTALLED VERSIONS

numpy: 1.20.3 pandas: 1.2.4 tqdm: 4.61.1 PyYAML: 5.4.1 cloudpickle: 1.6.0 dask: 2021.6.0 distributed: 2021.6.0 psutil: 5.8.0 pip: 21.1.2 setuptools: 49.6.0.post20210108

nitinmnsn avatar Jun 30 '21 06:06 nitinmnsn

Thanks for the request @nitinmnsn , this would be good to have!

I think removing the requirement that where_primitives needs to be a subset of agg_primitives is good. I think instead of adding a where_primitive_options parameter we could update primitive_options to have some where specific filters. Users could avoid overlap between agg and where primitive options that share the same primitive by instantiating the primitive to differentiate them. Would that be sufficient?

Example:

where_mode = ft.primitives.Mode()
ft.dfs(...
       agg_primitives=['mode'],
       where_primitives=[where_mode],
       primitive_options={'mode': ....,
                          where_mode: ...}
        ...)

rwedge avatar Jul 01 '21 19:07 rwedge

Yes, that would be perfect!

  1. Removing the requirement that where_primitives needs to be a subset of agg_primitives would allow us to operate at just the string name of the primitives. Even in the primitive_options dictionary
  2. In the case of the same primitives to be used between agg_primitives and where_primitives, the distinction can be made as you suggest - "Users could avoid overlap between agg and where primitive options that share the same primitive by instantiating the primitive to differentiate them"

nitinmnsn avatar Jul 01 '21 21:07 nitinmnsn