feature_engine icon indicating copy to clipboard operation
feature_engine copied to clipboard

[NEW TRANSFORMER] exponential width discretiser

Open solegalli opened this issue 2 years ago • 10 comments

when a variable is in a logarithmic scale, it might make sense to create the intervals based on a log scale instead of linear scale.

Quote: " When the numbers span multiple magnitudes, it may be better to group by powers of 10 (or powers of any constant): 0–9, 10–99, 100–999, 1000–9999, etc. The bin widths grow exponentially "

the idea is taken from: Feature Engineering for Machine Learning" Alice Zheng, O'Reilly.

solegalli avatar May 07 '22 18:05 solegalli

This issue is a relatively simple one.

We need to use the EqualWidthDiscretiser() as a template, copy the class in a new python.py within the discretisation module, and update the fit method so that it creates the bin limits based on logarithmic scale.

We could find the intervals like this: np.floor(np.log10(X[variable)) and give the user the option to select from log10 and ln to begin with.

solegalli avatar May 09 '22 07:05 solegalli

It would work only with positive features. Maybe something like np.floor(np.cbrt(X[column])) (or np.floor(np.power(X[column], 1/(2i+1))) in general) would be more universal?

glevv avatar May 09 '22 18:05 glevv

This transformer's idea is to work with variables in the log scale.

For bespoke limits we have the ArbitraryDiscretiser already, which takes at the moment a dictionary with limits. I guess we could adapt that discretiser to be able to take functions as well instead of just limits?

solegalli avatar May 10 '22 06:05 solegalli

Discretisers ususally have bins as their attribute, but with this type of transformation there is no clean way of getting them.

I think there is already log transformer present in the library, maybe adding floor: bool = False as parameter to it will be more straightforward and logical? It will block the ability to inverse_transform data though (or inverse transform it but with quantization errors).

glevv avatar May 18 '22 17:05 glevv

The logic for this transformer can be obtained from cells 7 onward on this jupyter notebook.

It does contain clear interval limits. They are real numbers, obtained from a log transform of the variable as per in the notebook.

solegalli avatar May 29 '22 07:05 solegalli

This issue is a relatively simple one.

We need to use the EqualWidthDiscretiser() as a template, copy the class in a new python.py within the discretisation module, and update the fit method so that it creates the bin limits based on logarithmic scale.

We could find the intervals like this: np.floor(np.log10(X[variable)) and give the user the option to select from log10 and ln to begin with.

@solegalli Hi, I would like to give this a try if possible.

SangamSwadiK avatar Jul 05 '22 01:07 SangamSwadiK

Hi @SangamSwadiK

Go for it :)

Only thing, I am going on holidays on Thursday, so won't be able to review till August. Hope that's alright?

solegalli avatar Jul 05 '22 12:07 solegalli

Hi @SangamSwadiK

Go for it :)

Only thing, I am going on holidays on Thursday, so won't be able to review till August. Hope that's alright?

Great ! by then I would have proper PR I guess. But happy holidays !

SangamSwadiK avatar Jul 05 '22 13:07 SangamSwadiK

hi @solegalli and @SangamSwadiK,

Was this issue completed? I think I saw a PR created and committed for this issue.

Morgan-Sell avatar Aug 18 '22 22:08 Morgan-Sell

There is no PR for this transformer. But this is another one of those for which I am in 2 minds about whether it is useful or not.

solegalli avatar Aug 19 '22 08:08 solegalli

Hola @solegalli, have you made a decision about this transformer?

Morgan-Sell avatar Dec 06 '22 15:12 Morgan-Sell

Not yet.

solegalli avatar Dec 07 '22 10:12 solegalli

I think I can work on that. I have some snippets that worked (they will work with negative values too), so I can write some draft PR.

glevv avatar Jan 13 '23 07:01 glevv

Awesome :)

solegalli avatar Jan 13 '23 15:01 solegalli