Jérôme Dockès
Jérôme Dockès
@MarcoGorelli in case you have the time I'm sure you would have advice for better use of the dataframe API in this one!
after all we won't be using the dataframe API for this so it will be easier to just start a new branch
if we don't have time to merge this before the release maybe we should at least remove mentions of ndarrays from the tablevectorizer docstring
I was going to do it but I thought it would just introduce conflicts with the current tablevectorizer PR
For `GapEncoder` and "department": `GapEncoder` converts to numpy array, then finds and handles missing values by calling `sklearn.utils.fixes._object_dtype_isnan` https://github.com/skrub-data/skrub/blob/fade2006aa6a57255ac77e170b2516e2b41f48f2/skrub/_gap_encoder.py#L860 This in turn finds null values by comparing `X != X`....
for `GapEncoder` with `np.nan`: this one is actually not related to missing values, if you don't insert missing values you get the same error. The default n-gram range of the...
For `GapEncoder` "gender" and `None`: the behavior is actually the same as for the high-cardinality "department", what matters is whether the first (index `0`) value is `None` or not, because...
for `dedupliate`: `deduplicate` performs no special handling of missing values, so the call to `np.unique` on the first line fails whenever there are any
> Actually my comment above does not apply to deduplicate why not? couldn't we deduplicate the other non-missing strings and leave the missing values missing?
I think I prefer your suggested alternative, as those lists tend to become outdated and it is a bit difficult & arbitrary to decide what should go in the list