Jérôme Dockès comments

Results 396 comments of


                                            Jérôme Dockès

[WIP] Interpolationjoiner dataframe api

@MarcoGorelli in case you have the time I'm sure you would have advice for better use of the dataframe API in this one!

[WIP] Interpolationjoiner dataframe api

after all we won't be using the dataframe API for this so it will be easier to just start a new branch

[ENH] Drop numpy array input support

if we don't have time to merge this before the release maybe we should at least remove mentions of ndarrays from the tablevectorizer docstring

[ENH] Drop numpy array input support

I was going to do it but I thought it would just introduce conflicts with the current tablevectorizer PR

Missing values support is not consistent

For `GapEncoder` and "department": `GapEncoder` converts to numpy array, then finds and handles missing values by calling `sklearn.utils.fixes._object_dtype_isnan` https://github.com/skrub-data/skrub/blob/fade2006aa6a57255ac77e170b2516e2b41f48f2/skrub/_gap_encoder.py#L860 This in turn finds null values by comparing `X != X`....

Missing values support is not consistent

for `GapEncoder` with `np.nan`: this one is actually not related to missing values, if you don't insert missing values you get the same error. The default n-gram range of the...

Missing values support is not consistent

For `GapEncoder` "gender" and `None`: the behavior is actually the same as for the high-cardinality "department", what matters is whether the first (index `0`) value is `None` or not, because...

Missing values support is not consistent

for `dedupliate`: `deduplicate` performs no special handling of missing values, so the call to `np.unique` on the first line fails whenever there are any

Missing values support is not consistent

> Actually my comment above does not apply to deduplicate why not? couldn't we deduplicate the other non-missing strings and leave the missing values missing?

Add a "related projects" section in the documentation

I think I prefer your suggested alternative, as those lists tend to become outdated and it is a bit difficult & arbitrary to decide what should go in the list