evalml Use our own implementation for one-hot encoding

Use our own implementation for one-hot encoding

Open angela97lin opened this issue 4 years ago • 1 comments

trafficstars

Following discussion from #1936 and #830, we've had to work around our the scikit-learn implementation of one-hot encoding to add our own functionality. #1936 in particular works to add the ability to drop an encoded feature that only has two categories, but has to work around scikit-learn's implementation limitations.

Rolling our own implementation can help increase performance and avoid extra convoluted logic we've added to work out scikit-learn's implementation :)

Mar 18 '21 17:03 angela97lin

Pros of rolling our own:

#1936 : rather than dropping features after they're generated by sklearn, if we avoid computing them in the first place, we'll save time and complexity
Avoid the possibility of future cryptic bugs due to a mismatch in ordering or index between sklearn's return and what our code thinks it returns

Cons:

The sklearn OHE API and behavior is good, heh, and has been vetted by the community.

An alterative @chukarsten pointed out: we could propose an enhancement to the sklearn OHE to make per-column behavior easier.

Mar 18 '21 17:03 dsherry

evalml evalml copied to clipboard

Use our own implementation for one-hot encoding

evalml
evalml copied to clipboard