skrub icon indicating copy to clipboard operation
skrub copied to clipboard

[FEAT] Add LSA encoder

Open Vincent-Maladiere opened this issue 1 year ago • 6 comments

Problem Description

Latent Semantic Analysis (LSA) consists of a TfidfVectorizer followed by Singular Value Decomposition (SVD). Scikit-learn mentions it in TruncatedSVD, and I wonder why it hasn't been implemented in scikit-learn in the first place, @GaelVaroquaux?

Feature Description

Create the LSAEncoder, a simple pipeline chaining TfidfVectorizer and TruncatedSVD (or a PCA, both support sparse matrices).

Alternative Solutions

No response

Additional Context

No response

Vincent-Maladiere avatar Oct 21 '24 12:10 Vincent-Maladiere

Great!!

We need to think about a name. I think LSA is a bit of a technical name that might ring a bell to non technical users.

We brainstormed a bit in terms of name with @jeromedockes and @rcap107 . The name StringEncoder came to mind. It would be close to TextEncoder (https://github.com/skrub-data/skrub/pull/1077), but we feel that the difference is somewhat understandable.

That said, maybe it would be an argument to move the name TextEncoder to SentenceEncoder, which would also be (maybe) a good name because it would be more explicit (link to "SentenceTransformer")

GaelVaroquaux avatar Oct 21 '24 13:10 GaelVaroquaux

Very interesting! One might wonder why we don't consider the GapEncoder as a string encoder, though. WDYT?

Vincent-Maladiere avatar Oct 21 '24 15:10 Vincent-Maladiere

One might wonder why we don't consider the GapEncoder as a string encoder, though. WDYT?

Yes, this was raised, and it is true. I guess that one difference that I make is that the GapEncoder assumes more latent structure (aka dirty-category structure) than open ended strings.

One argument for naming it the "StringEncoder" is that if you really have no prior information on the data or the use of the encoding, it's probably a good default to encode a string. Of course, we'll have to have good "see also" section, and a good discussion in the docs.

GaelVaroquaux avatar Oct 21 '24 16:10 GaelVaroquaux

Okay, this sounds easy to explain in the doc!

Vincent-Maladiere avatar Oct 21 '24 16:10 Vincent-Maladiere

Scikit-learn mentions LSA in TruncatedSVD, and I wonder why it hasn't been implemented in scikit-learn in the first place

Any thoughts @GaelVaroquaux? I'm curious

Vincent-Maladiere avatar Oct 21 '24 16:10 Vincent-Maladiere

Scikit-learn mentions LSA in TruncatedSVD, and I wonder why it hasn't been implemented in scikit-learn in the first place

Any thoughts @.***? I'm curious

Probably because it's easy to implement with the tools in scikit-learn and scikit-learn being general (not focused on text or the like) it didn't feel like it should be there.

GaelVaroquaux avatar Oct 21 '24 16:10 GaelVaroquaux

Implemented in StringEncoder #1159

GaelVaroquaux avatar Feb 26 '25 14:02 GaelVaroquaux