skrub
skrub copied to clipboard
Faster alternative to GapEncoder
Problem Description
For encoding text/high-cardinality categories, ATM we have MinHashEncoder, which only works when the downstream learner is based on decision trees, and GapEncoder, which gives high-quality representations but is very slow. It would be good to have something similar to the GapEncoder but faster, maybe a SVD or scikit-learn's NMF
Feature Description
an encoder that works similarly to GapEncoder but is faster, possibly at the cost of less interpretable topics or slightly reduced prediction performance
related: #139
closing in favor of #1121