datasets
datasets copied to clipboard
Get an error "OverflowError: Python int too large to convert to C long" when loading a large dataset
Describe the bug
When load a large dataset with the following code
from datasets import load_dataset
dataset = load_dataset("liwu/MNBVC", 'news_peoples_daily', split='train')
We encountered the error: "OverflowError: Python int too large to convert to C long" The error look something like:
OverflowError: Python int too large to convert to C long
During handling of the above exception, another exception occurred:
OverflowError Traceback (most recent call last)
<ipython-input-7-0ed8700e662d> in <module>
----> 1 dataset = load_dataset("liwu/MNBVC", 'news_peoples_daily', split='train', cache_dir='/sfs/MNBVC/.cache/')
/sfs/MNBVC/venv/lib64/python3.6/site-packages/datasets/load.py in load_dataset(path, name, data_dir, data_files, split, cache_dir, features, download_config, download_mode, ignore_verifications, keep_in_memory, save_infos, revision, use_auth_token, task, streaming, **config_kwargs)
1749 ignore_verifications=ignore_verifications,
1750 try_from_hf_gcs=try_from_hf_gcs,
-> 1751 use_auth_token=use_auth_token,
1752 )
1753
/sfs/MNBVC/venv/lib64/python3.6/site-packages/datasets/builder.py in download_and_prepare(self, download_config, download_mode, ignore_verifications, try_from_hf_gcs, dl_manager, base_path, use_auth_token, **download_and_prepare_kwargs)
703 if not downloaded_from_gcs:
704 self._download_and_prepare(
--> 705 dl_manager=dl_manager, verify_infos=verify_infos, **download_and_prepare_kwargs
706 )
707 # Sync info
/sfs/MNBVC/venv/lib64/python3.6/site-packages/datasets/builder.py in _download_and_prepare(self, dl_manager, verify_infos)
1225
1226 def _download_and_prepare(self, dl_manager, verify_infos):
-> 1227 super()._download_and_prepare(dl_manager, verify_infos, check_duplicate_keys=verify_infos)
1228
1229 def _get_examples_iterable_for_split(self, split_generator: SplitGenerator) -> ExamplesIterable:
/sfs/MNBVC/venv/lib64/python3.6/site-packages/datasets/builder.py in _download_and_prepare(self, dl_manager, verify_infos, **prepare_split_kwargs)
791 try:
792 # Prepare split will record examples associated to the split
--> 793 self._prepare_split(split_generator, **prepare_split_kwargs)
794 except OSError as e:
795 raise OSError(
/sfs/MNBVC/venv/lib64/python3.6/site-packages/datasets/builder.py in _prepare_split(self, split_generator, check_duplicate_keys)
1219 writer.write(example, key)
1220 finally:
-> 1221 num_examples, num_bytes = writer.finalize()
1222
1223 split_generator.split_info.num_examples = num_examples
/sfs/MNBVC/venv/lib64/python3.6/site-packages/datasets/arrow_writer.py in finalize(self, close_stream)
536 # Re-intializing to empty list for next batch
537 self.hkey_record = []
--> 538 self.write_examples_on_file()
539 if self.pa_writer is None:
540 if self.schema:
/sfs/MNBVC/venv/lib64/python3.6/site-packages/datasets/arrow_writer.py in write_examples_on_file(self)
407 # Since current_examples contains (example, key) tuples
408 batch_examples[col] = [row[0][col] for row in self.current_examples]
--> 409 self.write_batch(batch_examples=batch_examples)
410 self.current_examples = []
411
/sfs/MNBVC/venv/lib64/python3.6/site-packages/datasets/arrow_writer.py in write_batch(self, batch_examples, writer_batch_size)
506 col_try_type = try_features[col] if try_features is not None and col in try_features else None
507 typed_sequence = OptimizedTypedSequence(batch_examples[col], type=col_type, try_type=col_try_type, col=col)
--> 508 arrays.append(pa.array(typed_sequence))
509 inferred_features[col] = typed_sequence.get_inferred_type()
510 schema = inferred_features.arrow_schema if self.pa_writer is None else self.schema
/sfs/MNBVC/venv/lib64/python3.6/site-packages/pyarrow/array.pxi in pyarrow.lib.array()
/sfs/MNBVC/venv/lib64/python3.6/site-packages/pyarrow/array.pxi in pyarrow.lib._handle_arrow_array_protocol()
/sfs/MNBVC/venv/lib64/python3.6/site-packages/datasets/arrow_writer.py in __arrow_array__(self, type)
180 else:
181 trying_cast_to_python_objects = True
--> 182 out = pa.array(cast_to_python_objects(data, only_1d_for_numpy=True))
183 # use smaller integer precisions if possible
184 if self.trying_int_optimization:
/sfs/MNBVC/venv/lib64/python3.6/site-packages/pyarrow/array.pxi in pyarrow.lib.array()
/sfs/MNBVC/venv/lib64/python3.6/site-packages/pyarrow/array.pxi in pyarrow.lib._sequence_to_array()
/sfs/MNBVC/venv/lib64/python3.6/site-packages/pyarrow/error.pxi in pyarrow.lib.pyarrow_internal_check_status()
OverflowError: Python int too large to convert to C long
However, that dataset can be loaded in a streaming manner:
from datasets import load_dataset
dataset = load_dataset("liwu/MNBVC", 'news_peoples_daily', split='train', streaming=True)
for i in dataset:
pass # it work well
Another issue is reported in our dataset hub: https://huggingface.co/datasets/liwu/MNBVC/discussions/2
Steps to reproduce the bug
from datasets import load_dataset dataset = load_dataset("liwu/MNBVC", 'news_peoples_daily', split='train')
Expected behavior
the dataset can be safely loaded
Environment info
datasetsversion: 2.4.0- Platform: Linux-3.10.0-1160.an7.x86_64-x86_64-with-centos-7.9
- Python version: 3.6.8
- PyArrow version: 6.0.1
- Pandas version: 1.1.5
This error means that one of the int32 (Value("int32")) columns in the dataset has a value that is out of the valid (int32) range.
I'll open a PR to print the name of a problematic column to make debugging such errors easier.
I am afraid int32 is not the reason for this error.
I have submitted a commit to use int64 for all ints in the dataset: https://huggingface.co/datasets/liwu/MNBVC/commit/857ac00d9eab96a6708ad6a82bd9001686042a9e
and I have updated my env to the latest datasets release: Copy-and-paste the text below in your GitHub issue.
datasetsversion: 2.13.1- Platform: macOS-13.2.1-arm64-arm-64bit
- Python version: 3.11.2
- Huggingface_hub version: 0.13.4
- PyArrow version: 11.0.0
- Pandas version: 1.5.3
But the error still exist
Downloading and preparing dataset mnbvc/news_peoples_daily to /Users/silver/.cache/huggingface/datasets/liwu___mnbvc/news_peoples_daily/0.0.1/ee380f6309fe9b8b0d1fb14d77118f132444f22c8c4b28bf5c1645312688e051...
Downloading data files: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 12/12 [00:00<00:00, 9070.40it/s]
Extracting data files: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 12/12 [00:00<00:00, 2697.16it/s]
---------------------------------------------------------------------------
OverflowError Traceback (most recent call last)
File ~/git/venv/lib/python3.11/site-packages/datasets/builder.py:1647, in GeneratorBasedBuilder._prepare_split_single(self, gen_kwargs, fpath, file_format, max_shard_size, split_info, check_duplicate_keys, job_id)
1646 example = self.info.features.encode_example(record) if self.info.features is not None else record
-> 1647 writer.write(example, key)
1648 num_examples_progress_update += 1
File ~/git/venv/lib/python3.11/site-packages/datasets/arrow_writer.py:490, in ArrowWriter.write(self, example, key, writer_batch_size)
488 self.hkey_record = []
--> 490 self.write_examples_on_file()
File ~/git/venv/lib/python3.11/site-packages/datasets/arrow_writer.py:448, in ArrowWriter.write_examples_on_file(self)
444 batch_examples[col] = [
445 row[0][col].to_pylist()[0] if isinstance(row[0][col], (pa.Array, pa.ChunkedArray)) else row[0][col]
446 for row in self.current_examples
447 ]
--> 448 self.write_batch(batch_examples=batch_examples)
449 self.current_examples = []
File ~/git/venv/lib/python3.11/site-packages/datasets/arrow_writer.py:553, in ArrowWriter.write_batch(self, batch_examples, writer_batch_size)
552 typed_sequence = OptimizedTypedSequence(col_values, type=col_type, try_type=col_try_type, col=col)
--> 553 arrays.append(pa.array(typed_sequence))
554 inferred_features[col] = typed_sequence.get_inferred_type()
File ~/git/venv/lib/python3.11/site-packages/pyarrow/array.pxi:236, in pyarrow.lib.array()
File ~/git/venv/lib/python3.11/site-packages/pyarrow/array.pxi:110, in pyarrow.lib._handle_arrow_array_protocol()
File ~/git/venv/lib/python3.11/site-packages/datasets/arrow_writer.py:189, in TypedSequence.__arrow_array__(self, type)
188 trying_cast_to_python_objects = True
--> 189 out = pa.array(cast_to_python_objects(data, only_1d_for_numpy=True))
190 # use smaller integer precisions if possible
File ~/git/venv/lib/python3.11/site-packages/pyarrow/array.pxi:320, in pyarrow.lib.array()
File ~/git/venv/lib/python3.11/site-packages/pyarrow/array.pxi:39, in pyarrow.lib._sequence_to_array()
File ~/git/venv/lib/python3.11/site-packages/pyarrow/error.pxi:144, in pyarrow.lib.pyarrow_internal_check_status()
OverflowError: Python int too large to convert to C long
During handling of the above exception, another exception occurred:
OverflowError Traceback (most recent call last)
File ~/git/venv/lib/python3.11/site-packages/datasets/builder.py:1656, in GeneratorBasedBuilder._prepare_split_single(self, gen_kwargs, fpath, file_format, max_shard_size, split_info, check_duplicate_keys, job_id)
1655 num_shards = shard_id + 1
-> 1656 num_examples, num_bytes = writer.finalize()
1657 writer.close()
File ~/git/venv/lib/python3.11/site-packages/datasets/arrow_writer.py:584, in ArrowWriter.finalize(self, close_stream)
583 self.hkey_record = []
--> 584 self.write_examples_on_file()
585 # If schema is known, infer features even if no examples were written
File ~/git/venv/lib/python3.11/site-packages/datasets/arrow_writer.py:448, in ArrowWriter.write_examples_on_file(self)
444 batch_examples[col] = [
445 row[0][col].to_pylist()[0] if isinstance(row[0][col], (pa.Array, pa.ChunkedArray)) else row[0][col]
446 for row in self.current_examples
447 ]
--> 448 self.write_batch(batch_examples=batch_examples)
449 self.current_examples = []
File ~/git/venv/lib/python3.11/site-packages/datasets/arrow_writer.py:553, in ArrowWriter.write_batch(self, batch_examples, writer_batch_size)
552 typed_sequence = OptimizedTypedSequence(col_values, type=col_type, try_type=col_try_type, col=col)
--> 553 arrays.append(pa.array(typed_sequence))
554 inferred_features[col] = typed_sequence.get_inferred_type()
File ~/git/venv/lib/python3.11/site-packages/pyarrow/array.pxi:236, in pyarrow.lib.array()
File ~/git/venv/lib/python3.11/site-packages/pyarrow/array.pxi:110, in pyarrow.lib._handle_arrow_array_protocol()
File ~/git/venv/lib/python3.11/site-packages/datasets/arrow_writer.py:189, in TypedSequence.__arrow_array__(self, type)
188 trying_cast_to_python_objects = True
--> 189 out = pa.array(cast_to_python_objects(data, only_1d_for_numpy=True))
190 # use smaller integer precisions if possible
File ~/git/venv/lib/python3.11/site-packages/pyarrow/array.pxi:320, in pyarrow.lib.array()
File ~/git/venv/lib/python3.11/site-packages/pyarrow/array.pxi:39, in pyarrow.lib._sequence_to_array()
File ~/git/venv/lib/python3.11/site-packages/pyarrow/error.pxi:144, in pyarrow.lib.pyarrow_internal_check_status()
OverflowError: Python int too large to convert to C long
The above exception was the direct cause of the following exception:
DatasetGenerationError Traceback (most recent call last)
Cell In[2], line 1
----> 1 dataset = load_dataset("liwu/MNBVC", 'news_peoples_daily', split='train')
File ~/git/venv/lib/python3.11/site-packages/datasets/load.py:1809, in load_dataset(path, name, data_dir, data_files, split, cache_dir, features, download_config, download_mode, verification_mode, ignore_verifications, keep_in_memory, save_infos, revision, use_auth_token, task, streaming, num_proc, storage_options, **config_kwargs)
1806 try_from_hf_gcs = path not in _PACKAGED_DATASETS_MODULES
1808 # Download and prepare data
-> 1809 builder_instance.download_and_prepare(
1810 download_config=download_config,
1811 download_mode=download_mode,
1812 verification_mode=verification_mode,
1813 try_from_hf_gcs=try_from_hf_gcs,
1814 num_proc=num_proc,
1815 storage_options=storage_options,
1816 )
1818 # Build dataset for splits
1819 keep_in_memory = (
1820 keep_in_memory if keep_in_memory is not None else is_small_dataset(builder_instance.info.dataset_size)
1821 )
File ~/git/venv/lib/python3.11/site-packages/datasets/builder.py:909, in DatasetBuilder.download_and_prepare(self, output_dir, download_config, download_mode, verification_mode, ignore_verifications, try_from_hf_gcs, dl_manager, base_path, use_auth_token, file_format, max_shard_size, num_proc, storage_options, **download_and_prepare_kwargs)
907 if num_proc is not None:
908 prepare_split_kwargs["num_proc"] = num_proc
--> 909 self._download_and_prepare(
910 dl_manager=dl_manager,
911 verification_mode=verification_mode,
912 **prepare_split_kwargs,
913 **download_and_prepare_kwargs,
914 )
915 # Sync info
916 self.info.dataset_size = sum(split.num_bytes for split in self.info.splits.values())
File ~/git/venv/lib/python3.11/site-packages/datasets/builder.py:1670, in GeneratorBasedBuilder._download_and_prepare(self, dl_manager, verification_mode, **prepare_splits_kwargs)
1669 def _download_and_prepare(self, dl_manager, verification_mode, **prepare_splits_kwargs):
-> 1670 super()._download_and_prepare(
1671 dl_manager,
1672 verification_mode,
1673 check_duplicate_keys=verification_mode == VerificationMode.BASIC_CHECKS
1674 or verification_mode == VerificationMode.ALL_CHECKS,
1675 **prepare_splits_kwargs,
1676 )
File ~/git/venv/lib/python3.11/site-packages/datasets/builder.py:1004, in DatasetBuilder._download_and_prepare(self, dl_manager, verification_mode, **prepare_split_kwargs)
1000 split_dict.add(split_generator.split_info)
1002 try:
1003 # Prepare split will record examples associated to the split
-> 1004 self._prepare_split(split_generator, **prepare_split_kwargs)
1005 except OSError as e:
1006 raise OSError(
1007 "Cannot find data file. "
1008 + (self.manual_download_instructions or "")
1009 + "\nOriginal error:\n"
1010 + str(e)
1011 ) from None
File ~/git/venv/lib/python3.11/site-packages/datasets/builder.py:1508, in GeneratorBasedBuilder._prepare_split(self, split_generator, check_duplicate_keys, file_format, num_proc, max_shard_size)
1506 job_id = 0
1507 with pbar:
-> 1508 for job_id, done, content in self._prepare_split_single(
1509 gen_kwargs=gen_kwargs, job_id=job_id, **_prepare_split_args
1510 ):
1511 if done:
1512 result = content
File ~/git/venv/lib/python3.11/site-packages/datasets/builder.py:1665, in GeneratorBasedBuilder._prepare_split_single(self, gen_kwargs, fpath, file_format, max_shard_size, split_info, check_duplicate_keys, job_id)
1663 if isinstance(e, SchemaInferenceError) and e.__context__ is not None:
1664 e = e.__context__
-> 1665 raise DatasetGenerationError("An error occurred while generating the dataset") from e
1667 yield job_id, True, (total_num_examples, total_num_bytes, writer._features, num_shards, shard_lengths)
DatasetGenerationError: An error occurred while generating the dataset
Besides, it works fine when I am using streamed dataset.
simhash is the problematic column - it has values such as 18329103420363166823 that are out of the int64 range. You can fix this by setting the feature type to Value("string") (it's advised to use this type for hash values in general)
Besides, it works fine when I am using streamed dataset.
Streaming yields Python dictionaries from the script without converting them to the Arrow representation, as this conversion step is not that cheap performance-wise.
i am using uint64 for simhash
uint64 ranges up to about 3.69E19.
18329103420363166823 is less than this value.
moreover, our simhash algorithm use 64 bits. it should fit in uint64.
You are right. I overlooked the feature type.
This is a reproducer:
import pyarrow as pa
from datasets.arrow_writer import TypedSequence
pa.array(TypedSequence([18329103420363166823], type=Value("uint64")))
pa.array([18329103420363166823]) also fails with the same error, so it seems PyArrow does not always infer the correct type as NumPy does (uint64 in this case).
I'll report this issue in the Arrow repo.
pa.array([18329103420363166823], pa.uint64) works, so maybe we can implement a temporary fix (supporting complex input such as [{"image": pil_image, "num": uint64_value}] would be hard though).
In the meantime, you should be able to bypass this error by returning the simhash values as NumPy scalars in the script:
def _generate_examples(self, ...):
...
yield {..., "simhash": np.uint64(simhash), ...}
Thank you for checking this issue in detail.
However, it seems that using np.uint64(simhash) does not work. The same issue still exists.
https://huggingface.co/datasets/liwu/MNBVC/commit/1e44f1e400b7e61052647d44c99cdae3bae9c830
Anyway, we decide to use string type for these simhash values. Hope pyarrow can fix their bug soon.
Arrow issue: https://github.com/apache/arrow/issues/36520
May be something read your training data line by line. Then your training data just only one line. It is so large. I guess.