skrub
skrub copied to clipboard
`GapEncoder` is slow
Following some experiments I did as part of my work on GAMA, I noticed the GapEncoder
is very slow on medium to big datasets.
As discussed with @alexis-cvetkov, it's also something he noticed during his experiments.
A solution suggested by Gaël would be to early-stop the iterative process, which would make it quicker to converge, at the cost of some accuracy.
We would need examples of datasets on which it was too slow, to do some empirical work.
The issue might become apparent with traffic_violations
. Code to reproduce:
from dirty_cat import SuperVectorizer
from dirty_cat.datasets import fetch_traffic_violations
ds = fetch_traffic_violations()
sv = SuperVectorizer()
sv.fit(ds.X) # This will take a while...
print(sv.transformers)
# This should print the columns associated with the `GapEncoder`.
Later on, to reproduce without the (slight) overhead of the SuperVectorizer
, simply instanciate and transform the columns output by sv.transformers
.
from dirty_cat import GapEncoder
gap = GapEncoder()
gap.fit(columns)
Can you point me to the column that is long encode ? Because the script is taking forever on my machine :-/
Yeah that's the issue :sweat_smile: I've launched it on a server, I'll update you as soon as it's finished. Otherwise, you can try using a subset of the datasets' samples.
Here's the output of the above code! Sorry for the delay
[
("datetime", DatetimeEncoder(), ["date_of_stop", "time_of_stop"]),
("low_card_cat", OneHotEncoder(drop="if_binary"), ["agency", "subagency", "accident", "belts", "personal_injury", "property_damage", "fatal", "commercial_license", "hazmat", "commercial_vehicle", "alcohol", "work_zone", "search_conducted", "search_disposition", "search_outcome", "search_reason", "search_type", "search_arrest_reason", "vehicletype", "color", "article", "race", "gender", "arrest_type"]),
("high_card_cat", GapEncoder(n_components=30), ["seqid", "description", "location", "search_reason_for_stop", "state", "make", "model", "charge", "driver_city", "driver_state", "dl_state", "geolocation"])
]
Sorry, I also forgot to report the conclusion of my experiments. I did not find any major bottleneck in the encoder. From my experience, the gap encoder is slow because it diverges therefore it aalways reaches the max iteration. Also, the more we let him run, the worse is the performance of the model after. It is required to check this on other datasets.
Sorry, I also forgot to report the conclusion of my experiments. I did not find any major bottleneck in the encoder. From my experience, the gap encoder is slow because it diverges therefore it aalways reaches the max iteration. Also, the more we let him run, the worse is the performance of the model after. It is required to check this on other datasets.
Hi, do you remember on which dataset you saw the gap encoder diverge? I can't reproduce this on the traffic_violation dataset.
Fixed by #680