evalml
evalml copied to clipboard
Add Ordinal encoder component
https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OrdinalEncoder.html
Will need to run perf tests. Ideally, we can come up with some key examples where ordinal encoding outperforms one-hot encoding.
Added "needs design" because we should write out how this would be used. Need user to be able to select which categorical features have an ordering vs those which don't and that may require woodwork support first.
@dsherry This issue came up recently in some experiments I have been doing. In reviewing the results with @rpeck and @rwedge we noticed that several ordinal columns were getting encoded as regular categorical columns by the EvalML OneHotEncoder, so we would get a feature such as MONTH(Created)_9
for the 9th week of the year. @rpeck suggested we should not be encoding the Ordinal columns in this manner.
Any Woodwork columns that are ordered should be specified with the Ordinal
logical type. Setting a column as Ordinal
in Woodwork requires the order values to be defined, and the pandas dtype is set as CategoricalDtype
with the specification that the values are ordered.
As a concrete example of this, the Featuretools Month
primitive outputs an Ordinal
column in the feature matrix with the following dtype:
CategoricalDtype(categories=[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12], ordered=True)
Between the Woodwork logical type and the pandas dtype ordering, it seems like there should be enough information present to determine what columns should have Ordinal encoding applied.
@gsheni FYI
@chukarsten @asniyaz Can we prioritize this and add it to the next EvalML sprint? It is affecting our current work