ml_preprocessing icon indicating copy to clipboard operation
ml_preprocessing copied to clipboard

Using Encoder.oneHot like scikit-learn LabelBinarizer

Open CaptainDario opened this issue 4 years ago • 7 comments

First of all thanks for this nice package! But sadly I am already stuck at the beginning.

I am trying to use the Encoder.oneHot like the LabelBinarizer from scikit-learn. But I am not sure how to achieve that if it is even possible.

What I want is basically this:

# create an oneHotEncoder for my labels
y = ["a", "b", "c", ...]   # the labels i want to one hot encode
lb = LabelBinarizer()
lb.fit(y)
o_y = lb.transform(y)

# inference of CNN
...

# use the encoder on a prediction of a CNN to get the label (string) of the class
prediction = lb.inverse_transform(predicted)

The Encoder.oneHot forces me to provide a dataFrame instance to the constructor. However from the README it is not clear to me how that dataFrame should look like (also could you please update the link to the black friday data set).

Your help would be highly appreciated!

CaptainDario avatar Feb 10 '21 15:02 CaptainDario

@CaptainDario Thank you for creating the issue! Indeed, there are too few words in the README about encoding, I'd recommend you to look at live example Although a different encoder is used there, the key idea is the same - encoders from this lib infer labels from the provided data on their own, that's why you need to provide data first (using DataFrame). I suppose, it would be a good idea to add the ability to provide labels directly to encoders, I'll consider this in future updates of the lib

gyrdym avatar Feb 11 '21 07:02 gyrdym

@CaptainDario And regarding the additional info in README - I got your point, It's really needed to add some words on encoding + I'll fix the link

gyrdym avatar Feb 11 '21 07:02 gyrdym

Thank you for your quick help.

If I understand that right I need to create a dataframe with a feature containing all my values like this:

DataFrame([
["My Feature"],
["a"],
["b"],
["c"],
...,
["z"]
])

and than the created encoder will be able to convert new instances back to the label, right?

CaptainDario avatar Feb 11 '21 09:02 CaptainDario

Okay, I tried the above approach and it seems to be working. However the application crashes if the optional parameter featureNames is not given. Maybe it would be good to encode all labels/features if the parameter is unset.

But does an encoder provide a method to reverse the oneHot encoding something like unprocess which takes a DataFrame like

final dataFrame = DataFrame([
    ["character"], ["a"], ["b"], ["c"], ["d"],
  ]);
final encoder = Encoder.oneHot(dataFrame, featureNames: ["character"]);

final prediction = DataFrame([[0], [0], [0], [1]]);
final decoded = encoder.unprocess(prediction);

And decoded now contains the value "d". That would be really helpful.

CaptainDario avatar Feb 11 '21 10:02 CaptainDario

@CaptainDario thank you very much for such a precious feedback, I'll consider adding this functionality to the lib. Do you have any more problems with the package?

gyrdym avatar Feb 11 '21 10:02 gyrdym

Otherwise the package seems to be doing exactly what I want. Thank you! Because I need something like an unprocess method for progressing with my app, I will try to implement it for the encoder.oneHot. Do you think adding unprocess to encoder_impl.dart would be suitable?

CaptainDario avatar Feb 11 '21 11:02 CaptainDario

@CaptainDario I need to think it over, unprocess sounds a bit unclear for me.

gyrdym avatar Feb 12 '21 20:02 gyrdym