Deduplicate fails

Open MarieSacksick opened this issue 1 month ago • 1 comments

Describe the bug

I wanted to test the deduplicate feature on a very messy dataset for the mooc, but it turns out it is so messy the feature can't proceed.
I must say that I didn't really dig on why this happens.

Steps/Code to Reproduce

# %%
import pandas as pd
df = pd.read_csv("data_farm.csv")
# %%
from skrub import Cleaner, deduplicate, TableReport
# %%
cleaned = Cleaner().fit_transform(df)
# %%
col_name = "22. If above is YES, Do you follow techniques to enhance water use efficiency? "
# col_name = "21. If above is YES, Do you irrigate based on crop water need or abruptly when water is available?"
# to investigate the content
TableReport(cleaned[col_name])
cleaned[col_name].value_counts()
# %%
dedup = deduplicate(cleaned[col_name])

Data used: data_farm.csv

Expected Results

No error is thrown.
I experience this when using the cleaner and when I don't use it.

Actual Results

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
File /home/marie/Documents/mooc_skrub.py:4
      2 col_name = "22. If above is YES, Do you follow techniques to enhance water use efficiency? "
      3 # col_name = "21. If above is YES, Do you irrigate based on crop water need or abruptly when water is available?"
----> 4 dedup = deduplicate(cleaned[col_name])

File ~/anaconda3/envs/skore_test/lib/python3.12/site-packages/skrub/_deduplicate.py:232, in deduplicate(X, n_clusters, ngram_range, analyzer, linkage_method, n_jobs)
    131 def deduplicate(
    132     X,
    133     *,
   (...)
    138     n_jobs=None,
    139 ):
    140     """Deduplicate categorical data by hierarchically clustering similar strings.
    141 
    142     This works best if there are a number of underlying categories that
   (...)
    230 'white', 'white', 'white', 'white', 'white']
    231     """
--> 232     unique_words, counts = np.unique(X, return_counts=True)
    233     distance_mat = compute_ngram_distance(
    234         unique_words, ngram_range=ngram_range, analyzer=analyzer
    235     )
    237     Z = linkage(distance_mat, method=linkage_method, optimal_ordering=True)

File ~/anaconda3/envs/skore_test/lib/python3.12/site-packages/numpy/lib/_arraysetops_impl.py:286, in unique(ar, return_index, return_inverse, return_counts, axis, equal_nan)
    284 ar = np.asanyarray(ar)
    285 if axis is None:
--> 286     ret = _unique1d(ar, return_index, return_inverse, return_counts,
    287                     equal_nan=equal_nan, inverse_shape=ar.shape, axis=None)
    288     return _unpack_tuple(ret)
    290 # axis was specified and not None

File ~/anaconda3/envs/skore_test/lib/python3.12/site-packages/numpy/lib/_arraysetops_impl.py:353, in _unique1d(ar, return_index, return_inverse, return_counts, equal_nan, inverse_shape, axis)
    351     aux = ar[perm]
    352 else:
--> 353     ar.sort()
    354     aux = ar
    355 mask = np.empty(aux.shape, dtype=np.bool)

TypeError: '<' not supported between instances of 'str' and 'float'

Versions

System:
    python: 3.12.8 | packaged by Anaconda, Inc. | (main, Dec 11 2024, 16:31:09) [GCC 11.2.0]
executable: /home/marie/anaconda3/envs/skore_test/bin/python
   machine: Linux-6.8.0-87-generic-x86_64-with-glibc2.35

Python dependencies:
      sklearn: 1.6.1
          pip: 24.2
   setuptools: 75.1.0
        numpy: 2.2.0
        scipy: 1.14.1
       Cython: None
       pandas: 2.2.3
   matplotlib: 3.9.3
       joblib: 1.4.2
threadpoolctl: 3.5.0

Built with OpenMP: True

threadpoolctl info:
       user_api: blas
   internal_api: openblas
    num_threads: 12
         prefix: libscipy_openblas
       filepath: /home/marie/anaconda3/envs/skore_test/lib/python3.12/site-packages/numpy.libs/libscipy_openblas64_-6bb31eeb.so
        version: 0.3.28
threading_layer: pthreads
   architecture: Haswell

       user_api: blas
   internal_api: openblas
    num_threads: 12
         prefix: libscipy_openblas
       filepath: /home/marie/anaconda3/envs/skore_test/lib/python3.12/site-packages/scipy.libs/libscipy_openblas-c128ec02.so
        version: 0.3.27.dev
threading_layer: pthreads
   architecture: Haswell

       user_api: openmp
   internal_api: openmp
    num_threads: 12
         prefix: libgomp
       filepath: /home/marie/anaconda3/envs/skore_test/lib/python3.12/site-packages/scikit_learn.libs/libgomp-a34b3233.so.1.0.0
        version: None
0.6.2

Nov 10 '25 16:11 MarieSacksick

I suspect the issue may be caused by wrong parsing of the data 🤔

Nov 13 '25 15:11 rcap107