evalml
evalml copied to clipboard
Use our own implementation for one-hot encoding
trafficstars
Following discussion from #1936 and #830, we've had to work around our the scikit-learn implementation of one-hot encoding to add our own functionality. #1936 in particular works to add the ability to drop an encoded feature that only has two categories, but has to work around scikit-learn's implementation limitations.
Rolling our own implementation can help increase performance and avoid extra convoluted logic we've added to work out scikit-learn's implementation :)
Pros of rolling our own:
- #1936 : rather than dropping features after they're generated by sklearn, if we avoid computing them in the first place, we'll save time and complexity
- Avoid the possibility of future cryptic bugs due to a mismatch in ordering or index between sklearn's return and what our code thinks it returns
Cons:
- The sklearn OHE API and behavior is good, heh, and has been vetted by the community.
An alterative @chukarsten pointed out: we could propose an enhancement to the sklearn OHE to make per-column behavior easier.