skrub
skrub copied to clipboard
Deduplicate fails
Describe the bug
I wanted to test the deduplicate feature on a very messy dataset for the mooc, but it turns out it is so messy the feature can't proceed.
I must say that I didn't really dig on why this happens.
Steps/Code to Reproduce
# %%
import pandas as pd
df = pd.read_csv("data_farm.csv")
# %%
from skrub import Cleaner, deduplicate, TableReport
# %%
cleaned = Cleaner().fit_transform(df)
# %%
col_name = "22. If above is YES, Do you follow techniques to enhance water use efficiency? "
# col_name = "21. If above is YES, Do you irrigate based on crop water need or abruptly when water is available?"
# to investigate the content
TableReport(cleaned[col_name])
cleaned[col_name].value_counts()
# %%
dedup = deduplicate(cleaned[col_name])
Data used: data_farm.csv
Expected Results
No error is thrown.
I experience this when using the cleaner and when I don't use it.
Actual Results
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
File /home/marie/Documents/mooc_skrub.py:4
2 col_name = "22. If above is YES, Do you follow techniques to enhance water use efficiency? "
3 # col_name = "21. If above is YES, Do you irrigate based on crop water need or abruptly when water is available?"
----> 4 dedup = deduplicate(cleaned[col_name])
File ~/anaconda3/envs/skore_test/lib/python3.12/site-packages/skrub/_deduplicate.py:232, in deduplicate(X, n_clusters, ngram_range, analyzer, linkage_method, n_jobs)
131 def deduplicate(
132 X,
133 *,
(...)
138 n_jobs=None,
139 ):
140 """Deduplicate categorical data by hierarchically clustering similar strings.
141
142 This works best if there are a number of underlying categories that
(...)
230 'white', 'white', 'white', 'white', 'white']
231 """
--> 232 unique_words, counts = np.unique(X, return_counts=True)
233 distance_mat = compute_ngram_distance(
234 unique_words, ngram_range=ngram_range, analyzer=analyzer
235 )
237 Z = linkage(distance_mat, method=linkage_method, optimal_ordering=True)
File ~/anaconda3/envs/skore_test/lib/python3.12/site-packages/numpy/lib/_arraysetops_impl.py:286, in unique(ar, return_index, return_inverse, return_counts, axis, equal_nan)
284 ar = np.asanyarray(ar)
285 if axis is None:
--> 286 ret = _unique1d(ar, return_index, return_inverse, return_counts,
287 equal_nan=equal_nan, inverse_shape=ar.shape, axis=None)
288 return _unpack_tuple(ret)
290 # axis was specified and not None
File ~/anaconda3/envs/skore_test/lib/python3.12/site-packages/numpy/lib/_arraysetops_impl.py:353, in _unique1d(ar, return_index, return_inverse, return_counts, equal_nan, inverse_shape, axis)
351 aux = ar[perm]
352 else:
--> 353 ar.sort()
354 aux = ar
355 mask = np.empty(aux.shape, dtype=np.bool)
TypeError: '<' not supported between instances of 'str' and 'float'
Versions
System:
python: 3.12.8 | packaged by Anaconda, Inc. | (main, Dec 11 2024, 16:31:09) [GCC 11.2.0]
executable: /home/marie/anaconda3/envs/skore_test/bin/python
machine: Linux-6.8.0-87-generic-x86_64-with-glibc2.35
Python dependencies:
sklearn: 1.6.1
pip: 24.2
setuptools: 75.1.0
numpy: 2.2.0
scipy: 1.14.1
Cython: None
pandas: 2.2.3
matplotlib: 3.9.3
joblib: 1.4.2
threadpoolctl: 3.5.0
Built with OpenMP: True
threadpoolctl info:
user_api: blas
internal_api: openblas
num_threads: 12
prefix: libscipy_openblas
filepath: /home/marie/anaconda3/envs/skore_test/lib/python3.12/site-packages/numpy.libs/libscipy_openblas64_-6bb31eeb.so
version: 0.3.28
threading_layer: pthreads
architecture: Haswell
user_api: blas
internal_api: openblas
num_threads: 12
prefix: libscipy_openblas
filepath: /home/marie/anaconda3/envs/skore_test/lib/python3.12/site-packages/scipy.libs/libscipy_openblas-c128ec02.so
version: 0.3.27.dev
threading_layer: pthreads
architecture: Haswell
user_api: openmp
internal_api: openmp
num_threads: 12
prefix: libgomp
filepath: /home/marie/anaconda3/envs/skore_test/lib/python3.12/site-packages/scikit_learn.libs/libgomp-a34b3233.so.1.0.0
version: None
0.6.2
I suspect the issue may be caused by wrong parsing of the data 🤔