skrub Missing values support is not consistent

GapEncoder and deduplicate raise different errors when dealing with None or np.nan values, unlike MinHashEncoder and SimilarityEncoder which run successfully.

Interestingly, the errors differ when the column to encode is of high cardinality, like "department" or low cardinality/binary like "gender", from the employee dataset.

In the table below, we replace values in the columns "department" and "gender" with either np.nan or a None values, e.g. "department" with None corresponds to:

from skrub import GapEncoder
from skrub.datasets import fetch_employee_salaries

df = fetch_employee_salaries().X
df["department"].replace("POL", None, inplace=True)

GapEncoder().fit_transform(df[["department"]])
# AssertionError: Input data is not string.

	"department" with np.nan	"department" with None	"gender" with np.nan	"gender" with None
GapEncoder	Success	AssertionError: Input data is not string	ValueError: empty vocabulary; perhaps the documents only contain stop words	TypeError: '<' not supported between instances of 'NoneType' and 'str'
deduplicate	# TypeError: '<' not supported between instances of 'NoneType' and 'NoneType'	# TypeError: '<' not supported between instances of 'NoneType' and 'NoneType'	TypeError: '<' not supported between instances of 'float' and 'str'	TypeError: '<' not supported between instances of 'NoneType' and 'str'

Sep 29 '23 16:09 Vincent-Maladiere

For GapEncoder and "department":

GapEncoder converts to numpy array, then finds and handles missing values by calling sklearn.utils.fixes._object_dtype_isnan

https://github.com/skrub-data/skrub/blob/fade2006aa6a57255ac77e170b2516e2b41f48f2/skrub/_gap_encoder.py#L860

This in turn finds null values by comparing X != X. np.nan != np.nan is True, but None != None is False, which is why this method does not find None entries as being missing values, they are not imputed (replaced with ""), and later the check which asserts the first value in the series is a string fails.

Note using _object_dtype_isnan before extracting the dataframe values into a numpy array, or simply using pd.isnull or pd.isna, would correctly find the None entries.

Oct 04 '23 12:10 jeromedockes

for GapEncoder with np.nan: this one is actually not related to missing values, if you don't insert missing values you get the same error. The default n-gram range of the CountVectorizer starts at 2, so documents of length 1 result in 0 tokens, and the column contains only "F" and "M"

Oct 04 '23 12:10 jeromedockes

IMHO, we should strive to not error by default on missing values

Oct 04 '23 12:10 GaelVaroquaux

For GapEncoder "gender" and None: the behavior is actually the same as for the high-cardinality "department", what matters is whether the first (index 0) value is None or not, because the check only looks at the first:

https://github.com/skrub-data/skrub/blob/fade2006aa6a57255ac77e170b2516e2b41f48f2/skrub/_gap_encoder.py#L304

if it is None it fails at this point. If the None is elsewhere, the check passes but later on a call to np.unique fails in the CountVectorizer when it builds its vocabulary.

Oct 04 '23 12:10 jeromedockes

for dedupliate: deduplicate performs no special handling of missing values, so the call to np.unique on the first line fails whenever there are any

Oct 04 '23 13:10 jeromedockes

for dedupliate: deduplicate performs no special handling of missing values, so the call to np.unique on the first line fails whenever there are any

Actually my comment above does not apply to deduplicate

Oct 04 '23 13:10 GaelVaroquaux

Actually my comment above does not apply to deduplicate

why not? couldn't we deduplicate the other non-missing strings and leave the missing values missing?

Oct 04 '23 13:10 jeromedockes

Actually my comment above does not apply to deduplicate

why not? couldn't we deduplicate the other non-missing strings and leave the missing values missing?

Actually, given that we are matching only on one column, it does make sense indeed. So agreed with your proposal

Oct 04 '23 13:10 GaelVaroquaux