feature_engine
feature_engine copied to clipboard
Information value for feature selection using weight of evidence (WoE)
Information value is a variable-selection method that is used with binary classifiers. Information value summarizes how much knowing var_A
helps in predicting the dependent variable.
feature-engine includes a WoE Encoder. The InformationValue class will leverage the WoEEncoder class.
Closes #280
Notes from # 280:
https://www.listendata.com/2015/03/weight-of-evidence-woe-and-information.html
https://www.listendata.com/2019/08/WOE-IV-Continuous-Dependent.html
http://ucanalytics.com/blogs/information-value-and-weight-of-evidencebanking-case/
hola @solegalli,
I have a couple of question about syntax related to self
and the use of internal methods:
- If a method returns a variable that will be assigned to a class attribute should syntax be kept as such? Or, is it better to create the attribute in the method and end the method with
return self
? See the comparison below.
# Option A: _method() is created within the class
self.attribute = self._method(data)
# Option B:
def _method(self, data):
# code
# code
self.attribute = temp_variable
return self
Which is the preferred syntax? Option A or B?
- Should
self
always precede a method when the method iscalled within another method?- Imagine we have a method called
_validation_method()
that accepts variableX
._validation_method()
is called in thefit()
method. - Should the
_validation_method()
always be called using the following syntaxself._validation_method(X)
? Or, can we implement the method using_validation_method(X)
, i.e., sansself
? - Does using
self
give the method access to the class's attributes?
- Imagine we have a method called
hi @solegalli,
I believe this class is now ready for your virtuoso programming skills ;)
I'm confused as to why tests for CountryFrequencyEncoder and OrdnalEncoder are failing. Once you're settled, it would be great to dive into the details of some of these master/meta unit tests.
Hi @Morgan-Sell
PYI: https://github.com/Morgan-Sell/feature_engine/pull/12
Amazing, @solegalli! How do you see these improvements? Show me the way, Sra Miagi! ;)
I'll work on the unit tests.
One question, how is self.performance_drifts_
derived? It's not from any of the 3 parent classes. Is it from a grandparent class?
I don't think we need performance_drifts_ here. Do we?
I think this is still pending.
ToDo:
- [ ] remove this file from pr
- [ ] tests
- [ ] docs: index, api, user_guide
Hi @solegalli,
I'll work on the steps outlined above.
Questions/responses to the above comments:
-
I don't think this class requires performance_drifts_ as this attribute is used with shuffled features that's why I was a bit confused. If so, do we need to set a default value for
threshold
? -
Is the drop_psi_features.py the file that needs to be removed from the PR?
- yes, if i understand correctly we select features with iv beyond a threshold
- yes.
why are we deleting PSI selector?
hola @solegalli, I've returned from Black Rock City ;)
why are we deleting PSI selector?
Two messages prior, I asked, "Is the drop_psi_features.py the file that needs to be removed from the PR?". You responded "Yes"; therefore, I deleted the file.
I'm guessing that I misunderstood something. What would you like to be done?
I have another question regarding the unit tests. I'm predominantly using df_enc for the unit tests. Given that it's a made-up dataframe, each variable's information value is significantly low. Therefore, I use unrealistic thresholds, e.g. -1, to test the transformer's functionality.
Given that the unit tests assess the transformer's algorithmic functionality, is it OK for these transformers' init params to be impracticable?
hola @solegalli,
Are there any additional unit tests that we should create for this class?
Hi @solegalli,
Two questions:
Question 1: Most material that explains information value (IV) has a table that states if IV is greater than 0.5 then the variable's predictive power is "suspicious, too good to be true".
When a variable has an IV score > 0.5, does a data scientist drop that variable? If so, should this transformer two threshold init params - a floor and ceiling?
Question 2: I've read that IV "loosely" assumes that the independent and dependable variables have a linear relationship. Is this correct? And, if so, is IV only to be used with logistic regression?
hola @solegalli, I've returned from Black Rock City ;)
why are we deleting PSI selector?
Two messages prior, I asked, "Is the drop_psi_features.py the file that needs to be removed from the PR?". You responded "Yes"; therefore, I deleted the file.
I'm guessing that I misunderstood something. What would you like to be done?
I have another question regarding the unit tests. I'm predominantly using df_enc for the unit tests. Given that it's a made-up dataframe, each variable's information value is significantly low. Therefore, I use unrealistic thresholds, e.g. -1, to test the transformer's functionality.
Given that the unit tests assess the transformer's algorithmic functionality, is it OK for these transformers' init params to be impracticable?
I see. By "this file should be removed from the PR" I meant, we should not include changes to this file in this PR. Because here we are working on a completely different class. The way to "remove" a file from a PR is to checkout the version of the file in main. If you delete it, it will be deleted from main when we merge.
Regarding testing, the tests aim to test the logic. And the logic needs to make "sense". Now, we may be ok to add "impracticable" thresholds, but if later on we choose to restrict threshold parameters to what "makes sense", we would have to re-do all the tests. All of these is me thinking out loud, sorry. So I guess, it would make more sense to have tests that test not only the logic but also how how the transformer would be used in real life.
e read that IV "loo
Hi @Morgan-Sell
Q1) I think one threshold is enough. If IV > 0.5 it is for the user to know what to do. My view at least Q2) Yes, this method is intended for linear models.
Thank you!
Thank you for the thorough feedback, @solegalli!
As always, I have follow-up questions/comments.
Q1:
Q1) I think one threshold is enough. If IV > 0.5 it is for the user to know what to do. My view at least
This makes sense. In practice, however, the user will always have to execute sel.information_values_
to check the IV. This doesn't seem seamless. How would the user implement the transformer when using a scikit-learn Pipeline? How would the transformer be used in a deployed model?
Q2:
I see. By "this file should be removed from the PR" I meant, we should not include changes to this file in this PR. Because here we are working on a completely different class. The way to "remove" a file from a PR is to checkout the version of the file in main. If you delete it, it will be deleted from main when we merge.
Ok got it. I thought it was strange to delete the class. I apologize. I should've asked before doing so. What's the proper approach in recreating the PSIDropFeatures class? I'm guessing there's something more elegant than creating a new .py file and copying/pasting the existing code.
Q3:
Also, if the user guide is "borrowing" material from some blog /article, we need to credit them.
I was going to ask which citation methodology we use. Where can I find an example?
Q4:
Finally, it's been a while, so I forgot, this method is intended only for categorical variables? what do the blogs say?
Because I am thinking that we could discretize numerical first, then add woe, to calculate the iv, and select from all features, and not just categorical.
Good thought! In the user guide, I say categorical variables have "categories or bins". I view a discretized variable as a categorical variable where the bins are categories. Is this the generally-accept view or la vista del boludo?
Are you suggesting that we say, "Numerical variables can be used but they must first be discretized. And, if the bins are assigned integers as labels, then ignore_format
must be set to False
."?
hi @solegalli,
I mentioned earlier that I'm seeing unrealistic information values. I've tried a few different OpenML datasets and the IVs are all over the place. I think there may be an error in the variables being returned by self._calculate_woe()
and being passed to self._calculate_iv()
. Before I make crazy changes, I think it's best for you to check my logic. I added comments to the code base.
I used the table from this Medium article as the foundation of my logic.
hi @solegalli, I know you had a couple of deadlines. At the same time, I thought I'd check in to see if you had any thoughts to my comments/questions above. It seems like we want to increase feature-engine's feature selection capabilities ;)
This PR is the next one on my list. Thanks for the patience :)
Hola @solegalli, any thoughts? ;)
Hi @Morgan-Sell
I am continuing the conversation here: https://github.com/feature-engine/feature_engine/pull/542