feature_engine icon indicating copy to clipboard operation
feature_engine copied to clipboard

[Feature Creation] Decision tree creates a new feature by combining numerous variables

Open Morgan-Sell opened this issue 2 years ago • 16 comments

Closes #107

Notes from #107:

New variables are created by combination of user indicated variables with decision trees. Example: if user passes 3 variables to transformer, a new feature will be created fitting a decision tree with this tree variables and the target.

To think about: Should we make the transformer so that it combines variables in groups of 2s and 3s, etc? Say the user passes 5 variables, should we create features combining all possible groups of 2s, all possible groups of 3s, all possible groups of 4s and all 5?

Need to think a bit. I know that we do combine a few variables with trees to create new ones, particularly for use in linear models. But this brute force of combining everything with everything for the sake of combining, I have not seen in organisations where models will be used to score customers. So maybe not ideal. Also, increases computational cost, which is not in the spirit of feature-engine.

Morgan-Sell avatar May 12 '22 00:05 Morgan-Sell

halo @solegalli,

A few questions:

  • Did we make a final decision on whether the class should create new features from all the possible permutations of the user-selected variables? I guess we could create an all_permutations init param. Although, I am not convinced of the param's value. I can't think of a use case; however, my experience is limited.
  • Should the class allow the user to choose from all of the sklearn decision-tree init params - e.g. max_depth and min_sample_leaf - to prevent overfitting? Or, do we want to limit the user to 1 or 2 params?
  • I plan to limit the variables to numerical; unless, the categorical variable is encoded. Do you agree?
  • Should the class apply to both regression and classification?

Morgan-Sell avatar May 12 '22 01:05 Morgan-Sell

An idea would be:

These parameters in the init: variables = variable list (as always) output_features = None, integer, list of integers, tuple

So if I pass three variables in the list: [var1, var2, var3] and:

  • 1 in the output_features: return new features based on predictions of the three based on each variable individually, 3 new features
  • 2: make all possible combinations of 2 variables: (var1, var2), (var1, var3), (var2,var3), 3 new features in this example
  • 3 make all possible combinations of 3: in this case only 1 possible combination (var1,var2,var3), 1 new feature in this example
  • 4 or greater raise an error, more combinations than variables in list

If i pass a list, say [1,2], then we return the output of 1 and 2 as above. If None, then return all possible 1s, 2s, and 3s in this case, if the list contained more variables, then it would also be 4s and 5s

Alternatively, the user can pass a tuple with tuples (var1, (var,var2), (var1,var2,var3)) indicating how to combine the variables.

solegalli avatar May 12 '22 08:05 solegalli

hola @solegalli,

Espero que estes disfrutando vacay!

When/why would a person apply the decision-tree transformer to one variable?

Morgan-Sell avatar May 15 '22 00:05 Morgan-Sell

hallo @solegalli,

The transformer is generating new variables. I have created a few unit tests. I've written some of the docstrings. Before I progress, would you please review/counsel me? We both know I need it ;)

A few questions:

  • Which decision tree params should we include to mitigate the risk of overfitting? Currently, the class only accepts max_depth.
  • I was surprised that the BaseCreation class does not create self.variables_. Typically, we use the function _find_or_check_numerical_variables() to create/return self.variables_. It seems redundant to call this function given that it is called in the BaseCreation class.
  • Is this error ValueError: variables must a list of strings or integers comprise of distinct variables. Got None instead caused by not having self.variables_? Shouldn't this attribute be created/adopted from the BaseCreation class?
  • Do you see a more efficient approach to saving the fitted estimators and generating the new features?

Lastly, I included a couple of TODO comments.

Gracias!

Morgan-Sell avatar May 18 '22 23:05 Morgan-Sell

hi @solegalli,

how are you?! I guess you're busy writing your new book! Do you have any thoughts on my earlier comments?

Abrazo!

Morgan-Sell avatar Jun 04 '22 16:06 Morgan-Sell

No pasa nada, @solegalli! Bills are known to impact one's priorities ;)

I apologize for the lack of notes in the init()! My type-os didn't help!

The user has more optionality with the output_features parameter than normal init params. I want to ensure we put them on the right path early in their feature-engine adventure. Consequently, I created all of those checks.

I'm still working on incorporating all of your comments. Also, I will simplify the code by merging self.variables_combination_indices_ and self._fitted_estimators_ into a list of tuples.

Morgan-Sell avatar Jun 09 '22 01:06 Morgan-Sell

hen/why would a person apply the decision-tree transformer to one vari

Just catching up with old questions: if the variable has a non-linear relationship with a target, transforming that variable with a decision tree would, at least, create a monotonic relationship between the transformed variable and the target. Useful in cases where you want to stick with linear models. This is a good point to add to the user guide. Thanks for asking!

solegalli avatar Jun 09 '22 06:06 solegalli

ser has more option

Great! Thank you so much! give me a shout when it's done :)

solegalli avatar Jun 09 '22 06:06 solegalli

hi @solegalli,

In one of your comments, you said that decision trees can handle categorical variables. I thought sklearn's decision tree can handle categorical variables if they are one-hot encoded. Consequently, all variables must be numerical. If so, should the class find only numerical variables?

See sklearn docs. Excerpt: "Able to handle both numerical and categorical data. However scikit-learn implementation does not support categorical variables for now."

Either way, I think we should find the variables (either numerical or numerical/categorical) in the init(). This will allow the class to check whether the user provided the correct values for output_features when the param is a tuple and the user provides variable names. Otherwise, the code compares the output_features and None.

Morgan-Sell avatar Jun 11 '22 00:06 Morgan-Sell

I think this is just too complicated. We offer a lot of functionality to encode categorical variables. It is very easy for the user to add a categorical encoder ahead of using this transformer. And they can encode the variables however they like. So I would not go down this route.

I would stick with sklearn functionality. They only support numerical variables.

solegalli avatar Jun 12 '22 06:06 solegalli

hola @solegalli,

I think I'm heading down a rabbit hole. Before I do, I would like to hear your thoughts.

The init() performs a few checks to confirm that output_features is compatible with variables. E.g., Is the size of the variable combination compatible with the number of variables provided in the init()?

When variables is not None, the code is working. However, when variables is None, then output_features and variables cannot be compared because _find_or_check_numerical_variables has been executed.

Should we implement the output_features checks in fit() so the code only performs the checks once, i.e., after self.variables_ is created? If so, would the best approach be to create _check_output_features_is_permittied() that performs all necessary checks?

Morgan-Sell avatar Jun 12 '22 20:06 Morgan-Sell

Hi @Morgan-Sell

I made a PR: https://github.com/Morgan-Sell/feature_engine/pull/10

with some changes. I think I answer your question with the changes in the logic. If in doubt, simpler is better ;)

solegalli avatar Jun 29 '22 12:06 solegalli

Actually, I don't think I've done a great job, now that I am reading your question more carefully.

Let's move all the tests for output_features to the fit method.

Maybe create a hidden method called _validate_strategy(), that contains all relevant checks. And then we run _validate_strategy() in fit, after the _find_or_check_numerical_variables().

Good catch, well done @Morgan-Sell

Thank you very much.

solegalli avatar Jun 29 '22 12:06 solegalli

Hi @Morgan-Sell

I've seen you made a lot of commits. Is this work in progress? Do you still need to update the tests? They are all failing :_(

I am on holidays from Thursday till August. So if you don't hear from me... you know why ;)

Cheers

solegalli avatar Jul 05 '22 12:07 solegalli

hi @solegalli.

Ahh.... a month-long vacation! Hopefully, the US will adopt such traditions one day ;)

I'm still working on this class. I do have one question.

The following test is failing:

FAILED tests/test_creation/test_check_estimator_creation.py::test_check_estimator_from_sklearn[estimator6] - ValueError: No numerical variables found in this dataframe. Please check variable format with pandas dtypes.

Do you know if sklearn's check_estimator tests a dataframe without numerical variables? check_estimator docs has limited information.

If so, then the DecisionTreeFeatures should raise an error. Other feature-engine classes skip certain sklearn checks, which makes sense given that all of sklearn's checks are unlikely to be appropriate for each class. How do we select to omit certain tests?

Morgan-Sell avatar Jul 06 '22 23:07 Morgan-Sell

Hi @solegalli,

I'm embarrassed to say this, but I'm stumped by these errors. Hopefully, we can discuss the errors when you return.

Morgan-Sell avatar Jul 08 '22 00:07 Morgan-Sell