featuretools icon indicating copy to clipboard operation
featuretools copied to clipboard

allow where_primitives to function independently of agg_primitives

Open nitinmnsn opened this issue 3 years ago • 4 comments

I am using official prediction of customer churn example from here

For quick experimentation, I have added a cell between cel 19 and cell 20 to subset the cutoff_times to include only two msno (IDs). Like so:

cutoff_times_=cutoff_times.iloc[[33,34,21,22],:].reset_index(drop=True)

cutoff_times_ = cutoff_times_.rename(columns={'cutoff_time':'time'})

Then in cell 20, I notice I don't get where clause features made for all set(where_primitives) - set(agg_primitives) where primitives. I also get warnings.warn(warning_msg, UnusedPrimitiveWarning) for all the primitives that are there in the where_primitives list but not in the agg_primitives list.

Attaching a few examples (I have changed the max_depth to 10 to make sure that insufficient depth is not the cause): 1.

feature_defs,_ = ft.dfs(entityset=es, target_entity='members',
                      agg_primitives = [],
                      trans_primitives = ['month'],
                        cutoff_time_in_index = True,
                      cutoff_time = cutoff_times_,
                      where_primitives = ['max'],
                      max_depth=10, features_only=False)

output: 
/home/nitin/miniconda3/envs/featuretools/lib/python3.9/site-packages/featuretools/synthesis/dfs.py:307: UnusedPrimitiveWarning: Some specified primitives were not used during DFS:
  where_primitives: ['max']
This may be caused by a using a value of max_depth that is too small, not setting interesting values, or it may indicate no compatible variable types for the primitive were found in the data.
  warnings.warn(warning_msg, UnusedPrimitiveWarning)
feature_defs,_ = ft.dfs(entityset=es, target_entity='members',
                      agg_primitives = ['sum'],
                      trans_primitives = ['month'],
                        cutoff_time_in_index = True,
                      cutoff_time = cutoff_times_,
                      where_primitives = ['max','min'],
                      max_depth=10, features_only=False)
output:
/home/nitin/miniconda3/envs/featuretools/lib/python3.9/site-packages/featuretools/synthesis/dfs.py:307: UnusedPrimitiveWarning: Some specified primitives were not used during DFS:
  where_primitives: ['max', 'min']
This may be caused by a using a value of max_depth that is too small, not setting interesting values, or it may indicate no compatible variable types for the primitive were found in the data.
  warnings.warn(warning_msg, UnusedPrimitiveWarning)
feature_defs,_ = ft.dfs(entityset=es, target_entity='members',
                      agg_primitives = ['sum','min'],
                      trans_primitives = ['month'],
                        cutoff_time_in_index = True,
                      cutoff_time = cutoff_times_,
                      where_primitives = ['max','min'],
                      max_depth=10, features_only=False)

output:
/home/nitin/miniconda3/envs/featuretools/lib/python3.9/site-packages/featuretools/synthesis/dfs.py:307: UnusedPrimitiveWarning: Some specified primitives were not used during DFS:
  where_primitives: ['max']
This may be caused by a using a value of max_depth that is too small, not setting interesting values, or it may indicate no compatible variable types for the primitive were found in the data.
  warnings.warn(warning_msg, UnusedPrimitiveWarning)

feature_defs,_ = ft.dfs(entityset=es, target_entity='members',
                      agg_primitives = ['sum','min','max'],
                      trans_primitives = ['month'],
                        cutoff_time_in_index = True,
                      cutoff_time = cutoff_times_,
                      where_primitives = ['max','min','sum','std'],
                      max_depth=10, features_only=False)
output:
/home/nitin/miniconda3/envs/featuretools/lib/python3.9/site-packages/featuretools/synthesis/dfs.py:307: UnusedPrimitiveWarning: Some specified primitives were not used during DFS:
  where_primitives: ['std']
This may be caused by a using a value of max_depth that is too small, not setting interesting values, or it may indicate no compatible variable types for the primitive were found in the data.
  warnings.warn(warning_msg, UnusedPrimitiveWarning)
Featuretools version: 0.25.0 Featuretools installation directory: /home/nitin/miniconda3/envs/featuretools/lib/python3.9/site-packages/featuretools

SYSTEM INFO

python: 3.9.4.final.0 python-bits: 64 OS: Linux OS-release: 5.4.0-74-generic machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: en_IN LOCALE: en_IN.ISO8859-1

INSTALLED VERSIONS

numpy: 1.20.3 pandas: 1.2.4 tqdm: 4.61.1 PyYAML: 5.4.1 cloudpickle: 1.6.0 dask: 2021.6.0 distributed: 2021.6.0 psutil: 5.8.0 pip: 21.1.2 setuptools: 49.6.0.post20210108

nitinmnsn avatar Jun 30 '21 06:06 nitinmnsn

Thanks for the question! To avoid getting the warning, the where_primitives should also be included in agg_primitives. Interesting values should also be set as done in cell 17 in the notebook example. In cell 20, the parameter agg_primitives is also not set, so a default set of aggregation primitives get applied during DFS. All the where primitives in that DFS call are included in the default set of aggregation primitives. For reference, here is a quick reproducible example of the warning.

import featuretools as ft

es = ft.demo.load_mock_customer(return_entityset=True)
es['products']['brand'].interesting_values = ['A']

fm, fd = ft.dfs(
    entityset=es,
    target_entity='sessions',
    agg_primitives=[],
    trans_primitives=['month'],
    where_primitives=['max'],
)
UnusedPrimitiveWarning: Some specified primitives were not used during DFS:
  where_primitives: ['max']
This may be caused by a using a value of max_depth that is too small, not setting interesting values, or it may indicate no compatible variable types for the primitive were found in the data.
  warnings.warn(warning_msg, UnusedPrimitiveWarning)

When you remove agg_primitives from the DFS call, the default set of aggregation primitives get applied. The where primitives are also included in the default aggregation primitives, so the warning no longer appears.

fm, fd = ft.dfs(
    entityset=es,
    target_entity='sessions',
    trans_primitives=['month'],
    where_primitives=['max'],
)

A full list of the default aggregation primitives are listed in the docstring for featuretools.dfs:

agg_primitives (list[str or AggregationPrimitive], optional): List of Aggregation
    Feature types to apply.

        Default: ["sum", "std", "max", "skew", "min", "mean", "count", "percent_true", "num_unique", "mode"]

Let me know if this helps.

jeff-hernandez avatar Jun 30 '21 22:06 jeff-hernandez

Thank you for your help, Jeff.

I figured as much that where_primitives need to be a subset of agg_primitives.

Do you also think that this limits the freedom with which featuretools can be used? I use featuretools extensively (Everyone related to the creation of featuretools has my respect, gratitude and love. You guys rock!). I so often come across these scenarios where I want to apply a primitive only along with a certain where clause that I think it would be useful to have this additional dimension of control over primitives application.

nitinmnsn avatar Jun 30 '21 23:06 nitinmnsn

Thanks for clarifying! I think that would be a great request. Is this related to #1513?

jeff-hernandez avatar Jul 01 '21 19:07 jeff-hernandez

Yes. That is right. The two issues are:

  1. No way to control the where_primitives application #1513
  2. where_primitives need to be a subset of agg_primitives #1514 Hence I opened separate issues

nitinmnsn avatar Jul 01 '21 20:07 nitinmnsn