datasets
datasets copied to clipboard
Filter hangs
Describe the bug
When trying to filter my custom dataset, the process hangs, regardless of the lambda function used. It appears to be an issue with the way the Images are being handled. The dataset in question is a preprocessed version of https://huggingface.co/datasets/danaaubakirova/patfig where notably, I have converted the data to the Parquet format.
Steps to reproduce the bug
from datasets import load_dataset
ds = load_dataset('lcolonn/patfig', split='test')
ds_filtered = ds.filter(lambda row: row['cpc_class'] != 'Y')
Eventually I ctrl+C and I obtain this stack trace:
>>> ds_filtered = ds.filter(lambda row: row['cpc_class'] != 'Y')
Filter: 0%| | 0/998 [00:00<?, ? examples/s]Filter: 0%| | 0/998 [00:35<?, ? examples/s]
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/l-walewski/miniconda3/envs/patentqa/lib/python3.11/site-packages/datasets/arrow_dataset.py", line 567, in wrapper
out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/l-walewski/miniconda3/envs/patentqa/lib/python3.11/site-packages/datasets/fingerprint.py", line 482, in wrapper
out = func(dataset, *args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/l-walewski/miniconda3/envs/patentqa/lib/python3.11/site-packages/datasets/arrow_dataset.py", line 3714, in filter
indices = self.map(
^^^^^^^^^
File "/home/l-walewski/miniconda3/envs/patentqa/lib/python3.11/site-packages/datasets/arrow_dataset.py", line 602, in wrapper
out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/l-walewski/miniconda3/envs/patentqa/lib/python3.11/site-packages/datasets/arrow_dataset.py", line 567, in wrapper
out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/l-walewski/miniconda3/envs/patentqa/lib/python3.11/site-packages/datasets/arrow_dataset.py", line 3161, in map
for rank, done, content in Dataset._map_single(**dataset_kwargs):
File "/home/l-walewski/miniconda3/envs/patentqa/lib/python3.11/site-packages/datasets/arrow_dataset.py", line 3552, in _map_single
batch = apply_function_on_filtered_inputs(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/l-walewski/miniconda3/envs/patentqa/lib/python3.11/site-packages/datasets/arrow_dataset.py", line 3421, in apply_function_on_filtered_inputs
processed_inputs = function(*fn_args, *additional_args, **fn_kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/l-walewski/miniconda3/envs/patentqa/lib/python3.11/site-packages/datasets/arrow_dataset.py", line 6478, in get_indices_from_mask_function
num_examples = len(batch[next(iter(batch.keys()))])
~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/l-walewski/miniconda3/envs/patentqa/lib/python3.11/site-packages/datasets/formatting/formatting.py", line 273, in __getitem__
value = self.format(key)
^^^^^^^^^^^^^^^^
File "/home/l-walewski/miniconda3/envs/patentqa/lib/python3.11/site-packages/datasets/formatting/formatting.py", line 376, in format
return self.formatter.format_column(self.pa_table.select([key]))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/l-walewski/miniconda3/envs/patentqa/lib/python3.11/site-packages/datasets/formatting/formatting.py", line 443, in format_column
column = self.python_features_decoder.decode_column(column, pa_table.column_names[0])
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/l-walewski/miniconda3/envs/patentqa/lib/python3.11/site-packages/datasets/formatting/formatting.py", line 219, in decode_column
return self.features.decode_column(column, column_name) if self.features else column
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/l-walewski/miniconda3/envs/patentqa/lib/python3.11/site-packages/datasets/features/features.py", line 2008, in decode_column
[decode_nested_example(self[column_name], value) if value is not None else None for value in column]
File "/home/l-walewski/miniconda3/envs/patentqa/lib/python3.11/site-packages/datasets/features/features.py", line 2008, in <listcomp>
[decode_nested_example(self[column_name], value) if value is not None else None for value in column]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/l-walewski/miniconda3/envs/patentqa/lib/python3.11/site-packages/datasets/features/features.py", line 1351, in decode_nested_example
return schema.decode_example(obj, token_per_repo_id=token_per_repo_id)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/l-walewski/miniconda3/envs/patentqa/lib/python3.11/site-packages/datasets/features/image.py", line 188, in decode_example
image.load() # to avoid "Too many open files" errors
^^^^^^^^^^^^
File "/home/l-walewski/miniconda3/envs/patentqa/lib/python3.11/site-packages/PIL/ImageFile.py", line 293, in load
n, err_code = decoder.decode(b)
^^^^^^^^^^^^^^^^^
KeyboardInterrupt
Warning! This can even seem to cause some computers to crash.
Expected behavior
Should return the filtered dataset
Environment info
-
datasetsversion: 2.20.0 - Platform: Linux-6.5.0-41-generic-x86_64-with-glibc2.35
- Python version: 3.11.9
-
huggingface_hubversion: 0.24.0 - PyArrow version: 17.0.0
- Pandas version: 2.2.2
-
fsspecversion: 2024.5.0