Missing values support is not consistent
GapEncoder and deduplicate raise different errors when dealing with None or np.nan values, unlike MinHashEncoder and SimilarityEncoder which run successfully.
Interestingly, the errors differ when the column to encode is of high cardinality, like "department" or low cardinality/binary like "gender", from the employee dataset.
In the table below, we replace values in the columns "department" and "gender" with either np.nan or a None values, e.g. "department" with None corresponds to:
from skrub import GapEncoder
from skrub.datasets import fetch_employee_salaries
df = fetch_employee_salaries().X
df["department"].replace("POL", None, inplace=True)
GapEncoder().fit_transform(df[["department"]])
# AssertionError: Input data is not string.
| "department" with np.nan | "department" with None | "gender" with np.nan | "gender" with None | |
|---|---|---|---|---|
| GapEncoder | Success | AssertionError: Input data is not string | ValueError: empty vocabulary; perhaps the documents only contain stop words | TypeError: '<' not supported between instances of 'NoneType' and 'str' |
| deduplicate | # TypeError: '<' not supported between instances of 'NoneType' and 'NoneType' | # TypeError: '<' not supported between instances of 'NoneType' and 'NoneType' | TypeError: '<' not supported between instances of 'float' and 'str' | TypeError: '<' not supported between instances of 'NoneType' and 'str' |
For GapEncoder and "department":
GapEncoder converts to numpy array, then finds and handles missing values by calling sklearn.utils.fixes._object_dtype_isnan
https://github.com/skrub-data/skrub/blob/fade2006aa6a57255ac77e170b2516e2b41f48f2/skrub/_gap_encoder.py#L860
This in turn finds null values by comparing X != X.
np.nan != np.nan is True, but None != None is False, which is why this method does not find None entries as being missing values, they are not imputed (replaced with ""), and later the check which asserts the first value in the series is a string fails.
Note using _object_dtype_isnan before extracting the dataframe values into a numpy array, or simply using pd.isnull or pd.isna, would correctly find the None entries.
for GapEncoder with np.nan: this one is actually not related to missing values, if you don't insert missing values you get the same error. The default n-gram range of the CountVectorizer starts at 2, so documents of length 1 result in 0 tokens, and the column contains only "F" and "M"
IMHO, we should strive to not error by default on missing values
For GapEncoder "gender" and None: the behavior is actually the same as for the high-cardinality "department", what matters is whether the first (index 0) value is None or not, because the check only looks at the first:
https://github.com/skrub-data/skrub/blob/fade2006aa6a57255ac77e170b2516e2b41f48f2/skrub/_gap_encoder.py#L304
if it is None it fails at this point.
If the None is elsewhere, the check passes but later on a call to np.unique fails in the CountVectorizer when it builds its vocabulary.
for dedupliate: deduplicate performs no special handling of missing values, so the call to np.unique on the first line fails whenever there are any
for
dedupliate:deduplicateperforms no special handling of missing values, so the call tonp.uniqueon the first line fails whenever there are any
Actually my comment above does not apply to deduplicate
Actually my comment above does not apply to deduplicate
why not? couldn't we deduplicate the other non-missing strings and leave the missing values missing?
Actually my comment above does not apply to deduplicate
why not? couldn't we deduplicate the other non-missing strings and leave the missing values missing?
Actually, given that we are matching only on one column, it does make sense indeed. So agreed with your proposal