datasets
datasets copied to clipboard
pyarrow.lib.ArrowCapacityError: array cannot contain more than 2147483646 bytes, have 2147483648
Describe the bug
Following the example in CodeParrot, I receive an array size limitation error when deduplicating larger datasets.
Steps to reproduce the bug
dataset_name = "the_pile"
ds = load_dataset(dataset_name, split="train")
ds = ds.map(preprocess, num_proc=num_workers)
uniques = set(ds.unique("hash"))
Gists for minimum reproducible example: https://gist.github.com/conceptofmind/c5804428ea1bd89767815f9cd5f02d9a https://gist.github.com/conceptofmind/feafb07e236f28d79c2d4b28ffbdb6e2
Expected results
Chunking and writing out a deduplicated dataset.
Actual results
return dataset._data.column(column).unique().to_pylist()
File "pyarrow/table.pxi", line 394, in pyarrow.lib.ChunkedArray.unique
File "pyarrow/_compute.pyx", line 531, in pyarrow._compute.call_function
File "pyarrow/_compute.pyx", line 330, in pyarrow._compute.Function.call
File "pyarrow/error.pxi", line 143, in pyarrow.lib.pyarrow_internal_check_status
File "pyarrow/error.pxi", line 124, in pyarrow.lib.check_status
pyarrow.lib.ArrowCapacityError: array cannot contain more than 2147483646 bytes, have 2147483648
Thanks for reporting @conceptofmind.
Could you please give details about your environment?
## Environment info
<!-- You can run the command `datasets-cli env` and copy-and-paste its output below. -->
- `datasets` version:
- Platform:
- Python version:
- PyArrow version:
Hi @albertvillanova ,
Here is the environment information:
- `datasets` version: 2.3.2
- Platform: Linux-5.4.0-122-generic-x86_64-with-glibc2.27
- Python version: 3.9.12
- PyArrow version: 7.0.0
- Pandas version: 1.4.2
Thanks,
Enrico
I think this issue is solved here https://discuss.huggingface.co/t/minhash-deduplication/19992/12?u=loubnabnl, this only happens for very large datasets we will update it in CodeParrot code
Hi @loubnabnl,
Yes, the issue is solved in the discussion thread.
I will close this issue.
Thank you again for all of your help.
Enrico
Thanks @loubnabnl for pointing out the solution to this issue.