[Data] Dataset.unique() raises error in case of any null values
What happened + What you expected to happen
I wanted to get the unique values in a given column of my dataset, but some of the values are null for unavoidable reasons. Calling Dataset.unique(colname) on such data raises a TypeError, with differing specifics depending on how the column dtype is specified. This behavior was surprising since the equivalent operation on a pandas.Series works just fine, as does getting unique values via Python built-ins.
Here are two versions of type error I got, seemingly from the same line of code:
File ~/.pyenv/versions/3.9.18/envs/ev-detection/lib/python3.9/site-packages/ray/data/_internal/planner/exchange/sort_task_spec.py:110, in SortTaskSpec.sample_boundaries(blocks, sort_key, num_reducers)
107 sample_dict = BlockAccessor.for_block(samples).to_numpy(columns=columns)
108 # Compute sorted indices of the samples. In np.lexsort last key is the
109 # primary key hence have to reverse the order.
--> 110 indices = np.lexsort(list(reversed(list(sample_dict.values()))))
111 # Sort each column by indices, and calculate q-ths quantile items.
112 # Ignore the 1st item as it's not required for the boundary
113 for k, v in sample_dict.items():
File <__array_function__ internals>:180, in lexsort(*args, **kwargs)
TypeError: '<' not supported between instances of 'NoneType' and 'int'
and
File ~/.pyenv/versions/3.9.18/envs/test-env/lib/python3.9/site-packages/ray/data/_internal/planner/exchange/sort_task_spec.py:110, in SortTaskSpec.sample_boundaries(blocks, sort_key, num_reducers)
107 sample_dict = BlockAccessor.for_block(samples).to_numpy(columns=columns)
108 # Compute sorted indices of the samples. In np.lexsort last key is the
109 # primary key hence have to reverse the order.
--> 110 indices = np.lexsort(list(reversed(list(sample_dict.values()))))
111 # Sort each column by indices, and calculate q-ths quantile items.
112 # Ignore the 1st item as it's not required for the boundary
113 for k, v in sample_dict.items():
File <__array_function__ internals>:180, in lexsort(*args, **kwargs)
File missing.pyx:419, in pandas._libs.missing.NAType.__bool__()
TypeError: boolean value of NA is ambiguous
Versions / Dependencies
macOS 14.1 PY 3.9 ray == 2.9.0 pandas == 2.1.0
Reproduction script
import pandas as pd
import ray.data
items = [1, 2, 3, 2, 3, None]
# set(items) works fine, as expected
ds1 = ray.data.from_items(items)
ds1.unique("item")
# raises TypeError: '<' not supported between instances of 'NoneType' and 'int'
df = pd.DataFrame({"col": [1, 2, 3, None]}, dtype="Int64")
# df["col"].unique() works fine, as expected
ds2 = ray.data.from_pandas(df)
ds2.unique("col")
# raises TypeError: boolean value of NA is ambiguous
Issue Severity
Medium: It is a significant difficulty but I can work around it.
Hello burton, I'd like to work on this issue! TIA.
hi @Akshi22 , don't let me get in your way! though it looks like @ujjawal-khare-27 has already submitted a pr to fix this issue. maybe you can help there?
For what it's worth, I just ran into this issue again, only this time in the context of Dataset.groupby(col). It's the same error message, and presumably the same code under the hood. Just a bummer.
Hi, is this issue still open? If so, I'd like to get started contributing to Ray.io!