skrub icon indicating copy to clipboard operation
skrub copied to clipboard

Add a "drop_similar" argument to the TableVectorizer

Open GaelVaroquaux opened this issue 10 months ago • 7 comments

It would be great to make it really easy drop the redundant columns in the TableVectorizer (Using the DropSimilar transformer inside the TableVectorizer), by a simple additional argument.

This would have improvements both is speed/memory and maybe in statistical performance.

I realize that it makes the TableVectorizer even more a swiss-army knife than it currently is, but honestly, it's sooo useful and we use it everywhere, even as an element in complex pipelines.

GaelVaroquaux avatar Feb 26 '25 14:02 GaelVaroquaux

@GaelVaroquaux is this up for grabs? If yes, I'd like to contribute! :) I have experience with skrub, but this will be my first time contributing to the project itself, so I may ping you in this issue thread itself if I hit any roadblocks. Thanks!

Neilblaze avatar Mar 22 '25 07:03 Neilblaze

@GaelVaroquaux does this change look good? Tests are passing although I can see UserWarning warnings.

Neilblaze avatar Mar 22 '25 08:03 Neilblaze

Hi @Neilblaze ,

This is still up for grabs, and the change that you suggest are totally going in the right direction.

However, unless I am wrong, we don't have the DropSimilar transformer yet. It is issue #1001 , the parent issue. That issue would need to be tackled first, and I suggest that we go through the whole process of tackling it and merging it before moving on with this one. Understanding the process will then make you more productive.

Thanks!!

GaelVaroquaux avatar Mar 22 '25 08:03 GaelVaroquaux

@GaelVaroquaux Makes sense and thanks for the update! I'll keep track of #1001 for now, and hopefully when it gets merged into the main, we can continue with this.

Neilblaze avatar Mar 22 '25 08:03 Neilblaze

There is no PR for #1001 for now. I'm not sure when we'll get it done. We'd love to, but we're shuffling so many things 😁

GaelVaroquaux avatar Mar 22 '25 08:03 GaelVaroquaux

Yep, I saw that. So for the meantime, I'll try to grab some other open issues. Will ping here/on discord if I get stuck anywhere, and thanks! 😃

Neilblaze avatar Mar 22 '25 08:03 Neilblaze

Now that skrub has the DropUninformative transformer, this argument should be implmented there, rather than as a parameter of the TableVectorizer.

rcap107 avatar Sep 23 '25 08:09 rcap107