datasets pyarrow.lib.ArrowCapacityError: array cannot contain more than 2147483646 bytes, have 2147483648

pyarrow.lib.ArrowCapacityError: array cannot contain more than 2147483646 bytes, have 2147483648

Open conceptofmind opened this issue 3 years ago • 2 comments

Describe the bug

Following the example in CodeParrot, I receive an array size limitation error when deduplicating larger datasets.

Steps to reproduce the bug

dataset_name = "the_pile"
ds = load_dataset(dataset_name, split="train")
ds = ds.map(preprocess, num_proc=num_workers)
uniques = set(ds.unique("hash"))

Gists for minimum reproducible example: https://gist.github.com/conceptofmind/c5804428ea1bd89767815f9cd5f02d9a https://gist.github.com/conceptofmind/feafb07e236f28d79c2d4b28ffbdb6e2

Expected results

Chunking and writing out a deduplicated dataset.

Actual results

return dataset._data.column(column).unique().to_pylist()
File "pyarrow/table.pxi", line 394, in pyarrow.lib.ChunkedArray.unique
File "pyarrow/_compute.pyx", line 531, in pyarrow._compute.call_function
File "pyarrow/_compute.pyx", line 330, in pyarrow._compute.Function.call
File "pyarrow/error.pxi", line 143, in pyarrow.lib.pyarrow_internal_check_status
File "pyarrow/error.pxi", line 124, in pyarrow.lib.check_status
pyarrow.lib.ArrowCapacityError: array cannot contain more than 2147483646 bytes, have 2147483648

Aug 02 '22 18:08 conceptofmind

Thanks for reporting @conceptofmind.

Could you please give details about your environment?

## Environment info
<!-- You can run the command `datasets-cli env` and copy-and-paste its output below. -->
- `datasets` version:
- Platform:
- Python version:
- PyArrow version:

Aug 03 '22 09:08 albertvillanova

Hi @albertvillanova ,

Here is the environment information:

- `datasets` version: 2.3.2
- Platform: Linux-5.4.0-122-generic-x86_64-with-glibc2.27
- Python version: 3.9.12
- PyArrow version: 7.0.0
- Pandas version: 1.4.2

Thanks,

Enrico

Aug 03 '22 16:08 conceptofmind

I think this issue is solved here https://discuss.huggingface.co/t/minhash-deduplication/19992/12?u=loubnabnl, this only happens for very large datasets we will update it in CodeParrot code

Aug 19 '22 15:08 loubnabnl

Hi @loubnabnl,

Yes, the issue is solved in the discussion thread.

I will close this issue.

Thank you again for all of your help.

Enrico

Aug 20 '22 02:08 conceptofmind

Thanks @loubnabnl for pointing out the solution to this issue.

Aug 22 '22 09:08 albertvillanova

datasets datasets copied to clipboard

pyarrow.lib.ArrowCapacityError: array cannot contain more than 2147483646 bytes, have 2147483648

Describe the bug

Steps to reproduce the bug

Expected results

Actual results

datasets
datasets copied to clipboard