scikit-optimize
scikit-optimize copied to clipboard
[MRG] Keep order of variables in LabelEncoder
The problem
The current behavior of the LabelEncoder is to sort the variables when the mapping is performed. This happens because of the use of np.unique
which returns a sorted array of unique values: See https://github.com/Elementa-Engineering/scikit-optimize/blob/master/skopt/space/transformers.py#L175-L177 and https://numpy.org/doc/stable/reference/generated/numpy.unique.html
For example:
>>> from skopt.space.space import Categorical
>>> c = Categorical(("c", "b", "a"), transform="label")
>>> c.transform(["a", "b", "c"])
[0, 1, 2]
Note that the returned labels are 0, 1 and 2 (equivalent to ("a", "b", "c") even if the specified order was ("c", "b", "a")). This can be counter-intuitive, especially when the order of the variable "means" something for the user.
Implemented Fix
This PR, implements a simple fix, which retains the order of the categorical dimensions. The expected behavior then becomes:
>>> from skopt.space.space import Categorical
>>> c = Categorical(("c", "b", "a"), transform="label")
>>> c.transform(["a", "b", "c"])
[2, 1, 0]
The order is conserved.
Same goes for numerical numbers:
from skopt.space.space import Categorical
c = Categorical((10, 30, 20), transform="label")
c.transform([10, 20, 30])
[0, 2, 1]
@kernc, not sure why CI didn't fire up here, but this is ready for a review. :)
Try to push a commit again to trigger the CI, perhaps ?
Try to push a commit again to trigger the CI, perhaps ?
Still not working! Weird!
Well, I'm at a loss here... Have you run the tests locally using pytest ? Maybe the CI is straigh crashing on this PR, hence no info ?