ibis-ml
ibis-ml copied to clipboard
feat: alternative OHE using vectors
background:
instead of a full sql translation of one-hot encoding algorithm, he was envisioning more of as a backend registered function, which probably will be more performant.
That requires SQL metaprogramming to first enumerate a list of unique values in the columns and then craft the create statement
additionally it pigeon holes you into one hot encoding having to return separate columns per value whereas many frameworks, i.e. Spark MLlib if I remember correctly, return something like a 2d vector in lieu of those columns (which can be handled much more efficiently from a memory / hardware perspective)
Some response:
When fitting a one-hot-encoder we already have to collect all the cases so they're consistent across all applications of transform. The only difference here would be whether a one-hot-encoder should return a column-per-case or a column of an array of cases. I'd argue that since the consuming tooling will want a flat array, not special casing one-hot-encoding (for now) and returning a column-per-case is the correct approach. Also note - ibisml already has a OneHotEncode step that does all this.