skrub `GapEncoder` is slow

Following some experiments I did as part of my work on GAMA, I noticed the GapEncoder is very slow on medium to big datasets.

As discussed with @alexis-cvetkov, it's also something he noticed during his experiments.

A solution suggested by Gaël would be to early-stop the iterative process, which would make it quicker to converge, at the cost of some accuracy.

Sep 12 '22 13:09 LilianBoulard

We would need examples of datasets on which it was too slow, to do some empirical work.

Sep 12 '22 14:09 GaelVaroquaux

The issue might become apparent with traffic_violations. Code to reproduce:

from dirty_cat import SuperVectorizer
from dirty_cat.datasets import fetch_traffic_violations

ds = fetch_traffic_violations()
sv = SuperVectorizer()
sv.fit(ds.X)  # This will take a while...

print(sv.transformers)
# This should print the columns associated with the `GapEncoder`.

Later on, to reproduce without the (slight) overhead of the SuperVectorizer, simply instanciate and transform the columns output by sv.transformers.

from dirty_cat import GapEncoder

gap = GapEncoder()
gap.fit(columns)

Sep 14 '22 14:09 LilianBoulard

Can you point me to the column that is long encode ? Because the script is taking forever on my machine :-/

Sep 14 '22 15:09 AlexandreAbraham

Yeah that's the issue :sweat_smile: I've launched it on a server, I'll update you as soon as it's finished. Otherwise, you can try using a subset of the datasets' samples.

Sep 14 '22 15:09 LilianBoulard

Here's the output of the above code! Sorry for the delay

[
    ("datetime", DatetimeEncoder(), ["date_of_stop", "time_of_stop"]),
    ("low_card_cat", OneHotEncoder(drop="if_binary"), ["agency", "subagency", "accident", "belts", "personal_injury", "property_damage", "fatal", "commercial_license", "hazmat", "commercial_vehicle", "alcohol", "work_zone", "search_conducted", "search_disposition", "search_outcome", "search_reason", "search_type", "search_arrest_reason", "vehicletype", "color", "article", "race", "gender", "arrest_type"]),
    ("high_card_cat", GapEncoder(n_components=30), ["seqid", "description", "location", "search_reason_for_stop", "state", "make", "model", "charge", "driver_city", "driver_state", "dl_state", "geolocation"])
]

Oct 25 '22 13:10 LilianBoulard

Sorry, I also forgot to report the conclusion of my experiments. I did not find any major bottleneck in the encoder. From my experience, the gap encoder is slow because it diverges therefore it aalways reaches the max iteration. Also, the more we let him run, the worse is the performance of the model after. It is required to check this on other datasets.

Oct 25 '22 14:10 AlexandreAbraham

Sorry, I also forgot to report the conclusion of my experiments. I did not find any major bottleneck in the encoder. From my experience, the gap encoder is slow because it diverges therefore it aalways reaches the max iteration. Also, the more we let him run, the worse is the performance of the model after. It is required to check this on other datasets.

Hi, do you remember on which dataset you saw the gap encoder diverge? I can't reproduce this on the traffic_violation dataset.

May 17 '23 12:05 LeoGrin

Fixed by #680

Aug 31 '23 12:08 jovan-stojanovic

skrub skrub copied to clipboard

`GapEncoder` is slow

skrub
skrub copied to clipboard