feature_engine Information value for feature selection using weight of evidence (WoE)

Information value for feature selection using weight of evidence (WoE)

Open Morgan-Sell opened this issue 1 year ago • 19 comments

Information value is a variable-selection method that is used with binary classifiers. Information value summarizes how much knowing var_A helps in predicting the dependent variable.

feature-engine includes a WoE Encoder. The InformationValue class will leverage the WoEEncoder class.

Closes #280

Notes from # 280:

https://www.listendata.com/2015/03/weight-of-evidence-woe-and-information.html

https://www.listendata.com/2019/08/WOE-IV-Continuous-Dependent.html

http://ucanalytics.com/blogs/information-value-and-weight-of-evidencebanking-case/

Jul 20 '22 00:07 Morgan-Sell

hola @solegalli,

I have a couple of question about syntax related to self and the use of internal methods:

If a method returns a variable that will be assigned to a class attribute should syntax be kept as such? Or, is it better to create the attribute in the method and end the method with return self? See the comparison below.

# Option A: _method() is created within the class
self.attribute = self._method(data)

# Option B:
def _method(self, data):
   # code
   # code
   
   self.attribute = temp_variable

   return self

Which is the preferred syntax? Option A or B?

Should self always precede a method when the method iscalled within another method?
- Imagine we have a method called _validation_method() that accepts variable X. _validation_method() is called in the fit() method.
- Should the _validation_method() always be called using the following syntax self._validation_method(X)? Or, can we implement the method using _validation_method(X), i.e., sans self?
- Does using self give the method access to the class's attributes?

Jul 21 '22 00:07 Morgan-Sell

hi @solegalli,

I believe this class is now ready for your virtuoso programming skills ;)

I'm confused as to why tests for CountryFrequencyEncoder and OrdnalEncoder are failing. Once you're settled, it would be great to dive into the details of some of these master/meta unit tests.

Jul 29 '22 00:07 Morgan-Sell

Hi @Morgan-Sell

PYI: https://github.com/Morgan-Sell/feature_engine/pull/12

Aug 18 '22 09:08 solegalli

Amazing, @solegalli! How do you see these improvements? Show me the way, Sra Miagi! ;)

I'll work on the unit tests.

One question, how is self.performance_drifts_ derived? It's not from any of the 3 parent classes. Is it from a grandparent class?

Aug 20 '22 01:08 Morgan-Sell

I don't think we need performance_drifts_ here. Do we?

Aug 22 '22 08:08 solegalli

I think this is still pending.

ToDo:

[ ] remove this file from pr
[ ] tests
[ ] docs: index, api, user_guide

Aug 22 '22 08:08 solegalli

Hi @solegalli,

I'll work on the steps outlined above.

Questions/responses to the above comments:

I don't think this class requires performance_drifts_ as this attribute is used with shuffled features that's why I was a bit confused. If so, do we need to set a default value for threshold?
Is the drop_psi_features.py the file that needs to be removed from the PR?

Aug 25 '22 01:08 Morgan-Sell

yes, if i understand correctly we select features with iv beyond a threshold
yes.

Aug 25 '22 07:08 solegalli

why are we deleting PSI selector?

Aug 30 '22 09:08 solegalli

hola @solegalli, I've returned from Black Rock City ;)

why are we deleting PSI selector?

Two messages prior, I asked, "Is the drop_psi_features.py the file that needs to be removed from the PR?". You responded "Yes"; therefore, I deleted the file.

I'm guessing that I misunderstood something. What would you like to be done?

I have another question regarding the unit tests. I'm predominantly using df_enc for the unit tests. Given that it's a made-up dataframe, each variable's information value is significantly low. Therefore, I use unrealistic thresholds, e.g. -1, to test the transformer's functionality.

Given that the unit tests assess the transformer's algorithmic functionality, is it OK for these transformers' init params to be impracticable?

Sep 06 '22 23:09 Morgan-Sell

hola @solegalli,

Are there any additional unit tests that we should create for this class?

Sep 08 '22 01:09 Morgan-Sell

Hi @solegalli,

Two questions:

Question 1: Most material that explains information value (IV) has a table that states if IV is greater than 0.5 then the variable's predictive power is "suspicious, too good to be true".

When a variable has an IV score > 0.5, does a data scientist drop that variable? If so, should this transformer two threshold init params - a floor and ceiling?

Question 2: I've read that IV "loosely" assumes that the independent and dependable variables have a linear relationship. Is this correct? And, if so, is IV only to be used with logistic regression?

Sep 09 '22 00:09 Morgan-Sell

hola @solegalli, I've returned from Black Rock City ;)

why are we deleting PSI selector?

Two messages prior, I asked, "Is the drop_psi_features.py the file that needs to be removed from the PR?". You responded "Yes"; therefore, I deleted the file.

I'm guessing that I misunderstood something. What would you like to be done?

I have another question regarding the unit tests. I'm predominantly using df_enc for the unit tests. Given that it's a made-up dataframe, each variable's information value is significantly low. Therefore, I use unrealistic thresholds, e.g. -1, to test the transformer's functionality.

Given that the unit tests assess the transformer's algorithmic functionality, is it OK for these transformers' init params to be impracticable?

I see. By "this file should be removed from the PR" I meant, we should not include changes to this file in this PR. Because here we are working on a completely different class. The way to "remove" a file from a PR is to checkout the version of the file in main. If you delete it, it will be deleted from main when we merge.

Regarding testing, the tests aim to test the logic. And the logic needs to make "sense". Now, we may be ok to add "impracticable" thresholds, but if later on we choose to restrict threshold parameters to what "makes sense", we would have to re-do all the tests. All of these is me thinking out loud, sorry. So I guess, it would make more sense to have tests that test not only the logic but also how how the transformer would be used in real life.

Sep 09 '22 06:09 solegalli

e read that IV "loo

Hi @Morgan-Sell

Q1) I think one threshold is enough. If IV > 0.5 it is for the user to know what to do. My view at least Q2) Yes, this method is intended for linear models.

Thank you!

Sep 09 '22 06:09 solegalli

Thank you for the thorough feedback, @solegalli!

As always, I have follow-up questions/comments.

Q1:

Q1) I think one threshold is enough. If IV > 0.5 it is for the user to know what to do. My view at least

This makes sense. In practice, however, the user will always have to execute sel.information_values_ to check the IV. This doesn't seem seamless. How would the user implement the transformer when using a scikit-learn Pipeline? How would the transformer be used in a deployed model?

Q2:

I see. By "this file should be removed from the PR" I meant, we should not include changes to this file in this PR. Because here we are working on a completely different class. The way to "remove" a file from a PR is to checkout the version of the file in main. If you delete it, it will be deleted from main when we merge.

Ok got it. I thought it was strange to delete the class. I apologize. I should've asked before doing so. What's the proper approach in recreating the PSIDropFeatures class? I'm guessing there's something more elegant than creating a new .py file and copying/pasting the existing code.

Q3:

Also, if the user guide is "borrowing" material from some blog /article, we need to credit them.

I was going to ask which citation methodology we use. Where can I find an example?

Q4:

Finally, it's been a while, so I forgot, this method is intended only for categorical variables? what do the blogs say?

Because I am thinking that we could discretize numerical first, then add woe, to calculate the iv, and select from all features, and not just categorical.

Good thought! In the user guide, I say categorical variables have "categories or bins". I view a discretized variable as a categorical variable where the bins are categories. Is this the generally-accept view or la vista del boludo?

Are you suggesting that we say, "Numerical variables can be used but they must first be discretized. And, if the bins are assigned integers as labels, then ignore_format must be set to False."?

Sep 10 '22 04:09 Morgan-Sell

hi @solegalli,

I mentioned earlier that I'm seeing unrealistic information values. I've tried a few different OpenML datasets and the IVs are all over the place. I think there may be an error in the variables being returned by self._calculate_woe() and being passed to self._calculate_iv(). Before I make crazy changes, I think it's best for you to check my logic. I added comments to the code base.

I used the table from this Medium article as the foundation of my logic.

Sep 11 '22 03:09 Morgan-Sell

hi @solegalli, I know you had a couple of deadlines. At the same time, I thought I'd check in to see if you had any thoughts to my comments/questions above. It seems like we want to increase feature-engine's feature selection capabilities ;)

Sep 20 '22 15:09 Morgan-Sell

This PR is the next one on my list. Thanks for the patience :)

Sep 22 '22 06:09 solegalli

Hola @solegalli, any thoughts? ;)

Oct 05 '22 13:10 Morgan-Sell

Hi @Morgan-Sell

I am continuing the conversation here: https://github.com/feature-engine/feature_engine/pull/542

Oct 14 '22 13:10 solegalli

feature_engine feature_engine copied to clipboard

Information value for feature selection using weight of evidence (WoE)

feature_engine
feature_engine copied to clipboard