evalml icon indicating copy to clipboard operation
evalml copied to clipboard

Add Ordinal encoder component

Open dsherry opened this issue 4 years ago • 3 comments

https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OrdinalEncoder.html

Will need to run perf tests. Ideally, we can come up with some key examples where ordinal encoding outperforms one-hot encoding.

dsherry avatar Nov 02 '20 15:11 dsherry

Added "needs design" because we should write out how this would be used. Need user to be able to select which categorical features have an ordering vs those which don't and that may require woodwork support first.

dsherry avatar Nov 02 '20 16:11 dsherry

@dsherry This issue came up recently in some experiments I have been doing. In reviewing the results with @rpeck and @rwedge we noticed that several ordinal columns were getting encoded as regular categorical columns by the EvalML OneHotEncoder, so we would get a feature such as MONTH(Created)_9 for the 9th week of the year. @rpeck suggested we should not be encoding the Ordinal columns in this manner.

Any Woodwork columns that are ordered should be specified with the Ordinal logical type. Setting a column as Ordinal in Woodwork requires the order values to be defined, and the pandas dtype is set as CategoricalDtype with the specification that the values are ordered.

As a concrete example of this, the Featuretools Month primitive outputs an Ordinal column in the feature matrix with the following dtype:

CategoricalDtype(categories=[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12], ordered=True)

Between the Woodwork logical type and the pandas dtype ordering, it seems like there should be enough information present to determine what columns should have Ordinal encoding applied.

@gsheni FYI

thehomebrewnerd avatar May 17 '22 16:05 thehomebrewnerd

@chukarsten @asniyaz Can we prioritize this and add it to the next EvalML sprint? It is affecting our current work

gsheni avatar May 17 '22 16:05 gsheni