evalml icon indicating copy to clipboard operation
evalml copied to clipboard

Use our own implementation for one-hot encoding

Open angela97lin opened this issue 4 years ago • 1 comments
trafficstars

Following discussion from #1936 and #830, we've had to work around our the scikit-learn implementation of one-hot encoding to add our own functionality. #1936 in particular works to add the ability to drop an encoded feature that only has two categories, but has to work around scikit-learn's implementation limitations.

Rolling our own implementation can help increase performance and avoid extra convoluted logic we've added to work out scikit-learn's implementation :)

angela97lin avatar Mar 18 '21 17:03 angela97lin

Pros of rolling our own:

  • #1936 : rather than dropping features after they're generated by sklearn, if we avoid computing them in the first place, we'll save time and complexity
  • Avoid the possibility of future cryptic bugs due to a mismatch in ordering or index between sklearn's return and what our code thinks it returns

Cons:

  • The sklearn OHE API and behavior is good, heh, and has been vetted by the community.

An alterative @chukarsten pointed out: we could propose an enhancement to the sklearn OHE to make per-column behavior easier.

dsherry avatar Mar 18 '21 17:03 dsherry