river icon indicating copy to clipboard operation
river copied to clipboard

Online GapEncoder

Open MaxHalford opened this issue 8 months ago • 1 comments

skrub is a wonderful new project related to scikit-learn. You can see Gaël Varoquaux present it here. They have a transformer called GapEncoder: it's a way to embed fuzzy strings. This could be really powerful online, say for classifying Tweets or Twitch messages, where typos are aplenty.

We already have a way to do online TD-IDF/count vectorization. But we don't have Gamma-Poisson matrix factorization. It is doable online though. Once we have it, we could assemble the two into a nice GapEncoder class. See paper here.

This is related to #1412. Indeed, maybe this works well without Gamma-Poisson matrix factorization. For instance, we could use decomposition.LDA, which we already have.

MaxHalford avatar Nov 03 '23 09:11 MaxHalford