feature_engine
feature_engine copied to clipboard
Feat/binarizer without column transformer
Issue raised here
Notes on Code
The BinaryDiscretiser
class is implemented in binariser.py
, located with the other discretisers, and takes a parameter threshold to determine where to split the interval.
- After standard checks and type checks for threshold, there's a check to see if the threshold is in min(x) < threshold < max(x) for each feature x (L167). If not, then x isn't transformed and the user is notified of this. The remaining features are passed to a list for transformation.
- Because of the above, the transform method from the
BaseDiscretiser
is repeated here, only iterating through the new list of features that passed the threshold check rather than the list inself.variables_
. I'm not sure if there's a cleaner way of doing this. We could also modify theself.variables_
attribute directly in the fit method instead, which might make sense since then it would contain only features that were actually transformed, and there would be no need to re-implement the transform method.
Other notes
- I updated the docs apart from the user_guide since this might change depending on further changes to the implementation
- I've tested on an sklearn Pipeline and it seems to work fine but haven't included explicit tests for that as they were missing for the other discretisers. Let me know if that's something you'd want.
- It might be nice to have functionality where the user can pass a set of different thresholds for each feature passed to the class (could be corresponding lists for threshold and variables parameters, or a dictionary of pairs).
- The threshold check output is written to stdout at the moment, but this should perhaps be given as a warning instead.
Finally This is my first time contributing to open source – all feedback is very welcome!