feature_engine
feature_engine copied to clipboard
[NEW TRANSFORMER] exponential width discretiser
when a variable is in a logarithmic scale, it might make sense to create the intervals based on a log scale instead of linear scale.
Quote: " When the numbers span multiple magnitudes, it may be better to group by powers of 10 (or powers of any constant): 0–9, 10–99, 100–999, 1000–9999, etc. The bin widths grow exponentially "
the idea is taken from: Feature Engineering for Machine Learning" Alice Zheng, O'Reilly.
This issue is a relatively simple one.
We need to use the EqualWidthDiscretiser()
as a template, copy the class in a new python.py within the discretisation module, and update the fit method so that it creates the bin limits based on logarithmic scale.
We could find the intervals like this: np.floor(np.log10(X[variable))
and give the user the option to select from log10 and ln to begin with.
It would work only with positive features. Maybe something like np.floor(np.cbrt(X[column]))
(or np.floor(np.power(X[column], 1/(2i+1)))
in general) would be more universal?
This transformer's idea is to work with variables in the log scale.
For bespoke limits we have the ArbitraryDiscretiser already, which takes at the moment a dictionary with limits. I guess we could adapt that discretiser to be able to take functions as well instead of just limits?
Discretisers ususally have bins as their attribute, but with this type of transformation there is no clean way of getting them.
I think there is already log transformer present in the library, maybe adding floor: bool = False
as parameter to it will be more straightforward and logical? It will block the ability to inverse_transform data though (or inverse transform it but with quantization errors).
The logic for this transformer can be obtained from cells 7 onward on this jupyter notebook.
It does contain clear interval limits. They are real numbers, obtained from a log transform of the variable as per in the notebook.
This issue is a relatively simple one.
We need to use the
EqualWidthDiscretiser()
as a template, copy the class in a new python.py within the discretisation module, and update the fit method so that it creates the bin limits based on logarithmic scale.We could find the intervals like this:
np.floor(np.log10(X[variable))
and give the user the option to select from log10 and ln to begin with.
@solegalli Hi, I would like to give this a try if possible.
Hi @SangamSwadiK
Go for it :)
Only thing, I am going on holidays on Thursday, so won't be able to review till August. Hope that's alright?
Hi @SangamSwadiK
Go for it :)
Only thing, I am going on holidays on Thursday, so won't be able to review till August. Hope that's alright?
Great ! by then I would have proper PR I guess. But happy holidays !
hi @solegalli and @SangamSwadiK,
Was this issue completed? I think I saw a PR created and committed for this issue.
There is no PR for this transformer. But this is another one of those for which I am in 2 minds about whether it is useful or not.
Hola @solegalli, have you made a decision about this transformer?
Not yet.
I think I can work on that. I have some snippets that worked (they will work with negative values too), so I can write some draft PR.
Awesome :)