Add a "drop_similar" argument to the TableVectorizer
It would be great to make it really easy drop the redundant columns in the TableVectorizer (Using the DropSimilar transformer inside the TableVectorizer), by a simple additional argument.
This would have improvements both is speed/memory and maybe in statistical performance.
I realize that it makes the TableVectorizer even more a swiss-army knife than it currently is, but honestly, it's sooo useful and we use it everywhere, even as an element in complex pipelines.
@GaelVaroquaux is this up for grabs? If yes, I'd like to contribute! :) I have experience with skrub, but this will be my first time contributing to the project itself, so I may ping you in this issue thread itself if I hit any roadblocks. Thanks!
@GaelVaroquaux does this change look good? Tests are passing although I can see UserWarning warnings.
Hi @Neilblaze ,
This is still up for grabs, and the change that you suggest are totally going in the right direction.
However, unless I am wrong, we don't have the DropSimilar transformer yet. It is issue #1001 , the parent issue. That issue would need to be tackled first, and I suggest that we go through the whole process of tackling it and merging it before moving on with this one. Understanding the process will then make you more productive.
Thanks!!
@GaelVaroquaux Makes sense and thanks for the update! I'll keep track of #1001 for now, and hopefully when it gets merged into the main, we can continue with this.
There is no PR for #1001 for now. I'm not sure when we'll get it done. We'd love to, but we're shuffling so many things 😁
Yep, I saw that. So for the meantime, I'll try to grab some other open issues. Will ping here/on discord if I get stuck anywhere, and thanks! 😃
Now that skrub has the DropUninformative transformer, this argument should be implmented there, rather than as a parameter of the TableVectorizer.