gama icon indicating copy to clipboard operation
gama copied to clipboard

Implement the SuperVectorizer and dirty_cat's encoders to the search space

Open LilianBoulard opened this issue 3 years ago • 4 comments

This PR aims at implementing dirty_cat's encoders (currently SimilarityEncoder, GapEncoder and MinHashEncoder) to GAMA's search space via the use of the SuperVectorizer.

The point of adding the dirty_cat encoders is for GAMA to be able to handle dirty categorical features in tabular data.

Using the SuperVectorizer gives a simplified interface to the sklearn's ColumnTransformer, and allows to mix & match different encoding techniques.

For the content of this PR to run, the features implemented in dirty_cat 0.3 are required. However, at the time of writing these lines (August 2022), this version is not out yet.

TODO:

  • [x] wait for dirty_cat 0.3 to be out
  • [ ] fine-tune the preprocessing search space
  • [ ] benchmark GAMA to compare the performance before and after the introduction of the SuperVectorizer

LilianBoulard avatar Aug 25 '22 10:08 LilianBoulard

Please give me a ping here as soon as dirty cat 0.3 is released :)

PGijsbers avatar Aug 25 '22 12:08 PGijsbers

Hi Pieter, dirty_cat 0.3 is out!

LilianBoulard avatar Sep 14 '22 09:09 LilianBoulard

I allowed CI now, I'll try to have a closer look over this week and the next. I will probably do the 22.0.0 release without (since I was planning to do that today or tomorrow, as the current PyPI package is broken due to updated dependencies), so ignore the message about adding things to the changelog; I'll do that later when preparing for 22.1.0.

PGijsbers avatar Sep 14 '22 09:09 PGijsbers

Ah, it looks like the unit tests which used pre-defined individuals are broken now (to be expected). I am not entirely sure how I want to fix that - that will depend on whether or not we want to allow for the old behavior to be used as an alternative, and that would depend on a small benchmark. So I don't think there's much you can do right now as far as improving the tests/code.

Running some additional experiments to define a sensible default search space, as noted in the OP, should be possible and is appreciated :)

PGijsbers avatar Sep 14 '22 09:09 PGijsbers