datasets icon indicating copy to clipboard operation
datasets copied to clipboard

Get an error "OverflowError: Python int too large to convert to C long" when loading a large dataset

Open silverriver opened this issue 2 years ago • 8 comments

Describe the bug

When load a large dataset with the following code

from datasets import load_dataset
dataset = load_dataset("liwu/MNBVC", 'news_peoples_daily', split='train')

We encountered the error: "OverflowError: Python int too large to convert to C long" The error look something like:

OverflowError: Python int too large to convert to C long

During handling of the above exception, another exception occurred:

OverflowError                             Traceback (most recent call last)
<ipython-input-7-0ed8700e662d> in <module>
----> 1 dataset = load_dataset("liwu/MNBVC", 'news_peoples_daily', split='train', cache_dir='/sfs/MNBVC/.cache/')

/sfs/MNBVC/venv/lib64/python3.6/site-packages/datasets/load.py in load_dataset(path, name, data_dir, data_files, split, cache_dir, features, download_config, download_mode, ignore_verifications, keep_in_memory, save_infos, revision, use_auth_token, task, streaming, **config_kwargs)
   1749         ignore_verifications=ignore_verifications,
   1750         try_from_hf_gcs=try_from_hf_gcs,
-> 1751         use_auth_token=use_auth_token,
   1752     )
   1753 

/sfs/MNBVC/venv/lib64/python3.6/site-packages/datasets/builder.py in download_and_prepare(self, download_config, download_mode, ignore_verifications, try_from_hf_gcs, dl_manager, base_path, use_auth_token, **download_and_prepare_kwargs)
    703                     if not downloaded_from_gcs:
    704                         self._download_and_prepare(
--> 705                             dl_manager=dl_manager, verify_infos=verify_infos, **download_and_prepare_kwargs
    706                         )
    707                     # Sync info

/sfs/MNBVC/venv/lib64/python3.6/site-packages/datasets/builder.py in _download_and_prepare(self, dl_manager, verify_infos)
   1225 
   1226     def _download_and_prepare(self, dl_manager, verify_infos):
-> 1227         super()._download_and_prepare(dl_manager, verify_infos, check_duplicate_keys=verify_infos)
   1228 
   1229     def _get_examples_iterable_for_split(self, split_generator: SplitGenerator) -> ExamplesIterable:

/sfs/MNBVC/venv/lib64/python3.6/site-packages/datasets/builder.py in _download_and_prepare(self, dl_manager, verify_infos, **prepare_split_kwargs)
    791             try:
    792                 # Prepare split will record examples associated to the split
--> 793                 self._prepare_split(split_generator, **prepare_split_kwargs)
    794             except OSError as e:
    795                 raise OSError(

/sfs/MNBVC/venv/lib64/python3.6/site-packages/datasets/builder.py in _prepare_split(self, split_generator, check_duplicate_keys)
   1219                     writer.write(example, key)
   1220             finally:
-> 1221                 num_examples, num_bytes = writer.finalize()
   1222 
   1223         split_generator.split_info.num_examples = num_examples

/sfs/MNBVC/venv/lib64/python3.6/site-packages/datasets/arrow_writer.py in finalize(self, close_stream)
    536             # Re-intializing to empty list for next batch
    537             self.hkey_record = []
--> 538         self.write_examples_on_file()
    539         if self.pa_writer is None:
    540             if self.schema:

/sfs/MNBVC/venv/lib64/python3.6/site-packages/datasets/arrow_writer.py in write_examples_on_file(self)
    407             # Since current_examples contains (example, key) tuples
    408             batch_examples[col] = [row[0][col] for row in self.current_examples]
--> 409         self.write_batch(batch_examples=batch_examples)
    410         self.current_examples = []
    411 

/sfs/MNBVC/venv/lib64/python3.6/site-packages/datasets/arrow_writer.py in write_batch(self, batch_examples, writer_batch_size)
    506             col_try_type = try_features[col] if try_features is not None and col in try_features else None
    507             typed_sequence = OptimizedTypedSequence(batch_examples[col], type=col_type, try_type=col_try_type, col=col)
--> 508             arrays.append(pa.array(typed_sequence))
    509             inferred_features[col] = typed_sequence.get_inferred_type()
    510         schema = inferred_features.arrow_schema if self.pa_writer is None else self.schema

/sfs/MNBVC/venv/lib64/python3.6/site-packages/pyarrow/array.pxi in pyarrow.lib.array()

/sfs/MNBVC/venv/lib64/python3.6/site-packages/pyarrow/array.pxi in pyarrow.lib._handle_arrow_array_protocol()

/sfs/MNBVC/venv/lib64/python3.6/site-packages/datasets/arrow_writer.py in __arrow_array__(self, type)
    180             else:
    181                 trying_cast_to_python_objects = True
--> 182                 out = pa.array(cast_to_python_objects(data, only_1d_for_numpy=True))
    183             # use smaller integer precisions if possible
    184             if self.trying_int_optimization:

/sfs/MNBVC/venv/lib64/python3.6/site-packages/pyarrow/array.pxi in pyarrow.lib.array()

/sfs/MNBVC/venv/lib64/python3.6/site-packages/pyarrow/array.pxi in pyarrow.lib._sequence_to_array()

/sfs/MNBVC/venv/lib64/python3.6/site-packages/pyarrow/error.pxi in pyarrow.lib.pyarrow_internal_check_status()

OverflowError: Python int too large to convert to C long

However, that dataset can be loaded in a streaming manner:

from datasets import load_dataset
dataset = load_dataset("liwu/MNBVC", 'news_peoples_daily', split='train', streaming=True)

for i in dataset:
    pass  # it work well

Another issue is reported in our dataset hub: https://huggingface.co/datasets/liwu/MNBVC/discussions/2

Steps to reproduce the bug

from datasets import load_dataset dataset = load_dataset("liwu/MNBVC", 'news_peoples_daily', split='train')

Expected behavior

the dataset can be safely loaded

Environment info

  • datasets version: 2.4.0
  • Platform: Linux-3.10.0-1160.an7.x86_64-x86_64-with-centos-7.9
  • Python version: 3.6.8
  • PyArrow version: 6.0.1
  • Pandas version: 1.1.5

silverriver avatar Jul 05 '23 15:07 silverriver

This error means that one of the int32 (Value("int32")) columns in the dataset has a value that is out of the valid (int32) range.

I'll open a PR to print the name of a problematic column to make debugging such errors easier.

mariosasko avatar Jul 05 '23 19:07 mariosasko

I am afraid int32 is not the reason for this error.

I have submitted a commit to use int64 for all ints in the dataset: https://huggingface.co/datasets/liwu/MNBVC/commit/857ac00d9eab96a6708ad6a82bd9001686042a9e

and I have updated my env to the latest datasets release: Copy-and-paste the text below in your GitHub issue.

  • datasets version: 2.13.1
  • Platform: macOS-13.2.1-arm64-arm-64bit
  • Python version: 3.11.2
  • Huggingface_hub version: 0.13.4
  • PyArrow version: 11.0.0
  • Pandas version: 1.5.3

But the error still exist

Downloading and preparing dataset mnbvc/news_peoples_daily to /Users/silver/.cache/huggingface/datasets/liwu___mnbvc/news_peoples_daily/0.0.1/ee380f6309fe9b8b0d1fb14d77118f132444f22c8c4b28bf5c1645312688e051...
Downloading data files: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 12/12 [00:00<00:00, 9070.40it/s]
Extracting data files: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 12/12 [00:00<00:00, 2697.16it/s]
---------------------------------------------------------------------------
OverflowError                             Traceback (most recent call last)
File ~/git/venv/lib/python3.11/site-packages/datasets/builder.py:1647, in GeneratorBasedBuilder._prepare_split_single(self, gen_kwargs, fpath, file_format, max_shard_size, split_info, check_duplicate_keys, job_id)
   1646 example = self.info.features.encode_example(record) if self.info.features is not None else record
-> 1647 writer.write(example, key)
   1648 num_examples_progress_update += 1

File ~/git/venv/lib/python3.11/site-packages/datasets/arrow_writer.py:490, in ArrowWriter.write(self, example, key, writer_batch_size)
    488     self.hkey_record = []
--> 490 self.write_examples_on_file()

File ~/git/venv/lib/python3.11/site-packages/datasets/arrow_writer.py:448, in ArrowWriter.write_examples_on_file(self)
    444         batch_examples[col] = [
    445             row[0][col].to_pylist()[0] if isinstance(row[0][col], (pa.Array, pa.ChunkedArray)) else row[0][col]
    446             for row in self.current_examples
    447         ]
--> 448 self.write_batch(batch_examples=batch_examples)
    449 self.current_examples = []

File ~/git/venv/lib/python3.11/site-packages/datasets/arrow_writer.py:553, in ArrowWriter.write_batch(self, batch_examples, writer_batch_size)
    552 typed_sequence = OptimizedTypedSequence(col_values, type=col_type, try_type=col_try_type, col=col)
--> 553 arrays.append(pa.array(typed_sequence))
    554 inferred_features[col] = typed_sequence.get_inferred_type()

File ~/git/venv/lib/python3.11/site-packages/pyarrow/array.pxi:236, in pyarrow.lib.array()

File ~/git/venv/lib/python3.11/site-packages/pyarrow/array.pxi:110, in pyarrow.lib._handle_arrow_array_protocol()

File ~/git/venv/lib/python3.11/site-packages/datasets/arrow_writer.py:189, in TypedSequence.__arrow_array__(self, type)
    188     trying_cast_to_python_objects = True
--> 189     out = pa.array(cast_to_python_objects(data, only_1d_for_numpy=True))
    190 # use smaller integer precisions if possible

File ~/git/venv/lib/python3.11/site-packages/pyarrow/array.pxi:320, in pyarrow.lib.array()

File ~/git/venv/lib/python3.11/site-packages/pyarrow/array.pxi:39, in pyarrow.lib._sequence_to_array()

File ~/git/venv/lib/python3.11/site-packages/pyarrow/error.pxi:144, in pyarrow.lib.pyarrow_internal_check_status()

OverflowError: Python int too large to convert to C long

During handling of the above exception, another exception occurred:

OverflowError                             Traceback (most recent call last)
File ~/git/venv/lib/python3.11/site-packages/datasets/builder.py:1656, in GeneratorBasedBuilder._prepare_split_single(self, gen_kwargs, fpath, file_format, max_shard_size, split_info, check_duplicate_keys, job_id)
   1655 num_shards = shard_id + 1
-> 1656 num_examples, num_bytes = writer.finalize()
   1657 writer.close()

File ~/git/venv/lib/python3.11/site-packages/datasets/arrow_writer.py:584, in ArrowWriter.finalize(self, close_stream)
    583     self.hkey_record = []
--> 584 self.write_examples_on_file()
    585 # If schema is known, infer features even if no examples were written

File ~/git/venv/lib/python3.11/site-packages/datasets/arrow_writer.py:448, in ArrowWriter.write_examples_on_file(self)
    444         batch_examples[col] = [
    445             row[0][col].to_pylist()[0] if isinstance(row[0][col], (pa.Array, pa.ChunkedArray)) else row[0][col]
    446             for row in self.current_examples
    447         ]
--> 448 self.write_batch(batch_examples=batch_examples)
    449 self.current_examples = []

File ~/git/venv/lib/python3.11/site-packages/datasets/arrow_writer.py:553, in ArrowWriter.write_batch(self, batch_examples, writer_batch_size)
    552 typed_sequence = OptimizedTypedSequence(col_values, type=col_type, try_type=col_try_type, col=col)
--> 553 arrays.append(pa.array(typed_sequence))
    554 inferred_features[col] = typed_sequence.get_inferred_type()

File ~/git/venv/lib/python3.11/site-packages/pyarrow/array.pxi:236, in pyarrow.lib.array()

File ~/git/venv/lib/python3.11/site-packages/pyarrow/array.pxi:110, in pyarrow.lib._handle_arrow_array_protocol()

File ~/git/venv/lib/python3.11/site-packages/datasets/arrow_writer.py:189, in TypedSequence.__arrow_array__(self, type)
    188     trying_cast_to_python_objects = True
--> 189     out = pa.array(cast_to_python_objects(data, only_1d_for_numpy=True))
    190 # use smaller integer precisions if possible

File ~/git/venv/lib/python3.11/site-packages/pyarrow/array.pxi:320, in pyarrow.lib.array()

File ~/git/venv/lib/python3.11/site-packages/pyarrow/array.pxi:39, in pyarrow.lib._sequence_to_array()

File ~/git/venv/lib/python3.11/site-packages/pyarrow/error.pxi:144, in pyarrow.lib.pyarrow_internal_check_status()

OverflowError: Python int too large to convert to C long

The above exception was the direct cause of the following exception:

DatasetGenerationError                    Traceback (most recent call last)
Cell In[2], line 1
----> 1 dataset = load_dataset("liwu/MNBVC", 'news_peoples_daily', split='train')

File ~/git/venv/lib/python3.11/site-packages/datasets/load.py:1809, in load_dataset(path, name, data_dir, data_files, split, cache_dir, features, download_config, download_mode, verification_mode, ignore_verifications, keep_in_memory, save_infos, revision, use_auth_token, task, streaming, num_proc, storage_options, **config_kwargs)
   1806 try_from_hf_gcs = path not in _PACKAGED_DATASETS_MODULES
   1808 # Download and prepare data
-> 1809 builder_instance.download_and_prepare(
   1810     download_config=download_config,
   1811     download_mode=download_mode,
   1812     verification_mode=verification_mode,
   1813     try_from_hf_gcs=try_from_hf_gcs,
   1814     num_proc=num_proc,
   1815     storage_options=storage_options,
   1816 )
   1818 # Build dataset for splits
   1819 keep_in_memory = (
   1820     keep_in_memory if keep_in_memory is not None else is_small_dataset(builder_instance.info.dataset_size)
   1821 )

File ~/git/venv/lib/python3.11/site-packages/datasets/builder.py:909, in DatasetBuilder.download_and_prepare(self, output_dir, download_config, download_mode, verification_mode, ignore_verifications, try_from_hf_gcs, dl_manager, base_path, use_auth_token, file_format, max_shard_size, num_proc, storage_options, **download_and_prepare_kwargs)
    907     if num_proc is not None:
    908         prepare_split_kwargs["num_proc"] = num_proc
--> 909     self._download_and_prepare(
    910         dl_manager=dl_manager,
    911         verification_mode=verification_mode,
    912         **prepare_split_kwargs,
    913         **download_and_prepare_kwargs,
    914     )
    915 # Sync info
    916 self.info.dataset_size = sum(split.num_bytes for split in self.info.splits.values())

File ~/git/venv/lib/python3.11/site-packages/datasets/builder.py:1670, in GeneratorBasedBuilder._download_and_prepare(self, dl_manager, verification_mode, **prepare_splits_kwargs)
   1669 def _download_and_prepare(self, dl_manager, verification_mode, **prepare_splits_kwargs):
-> 1670     super()._download_and_prepare(
   1671         dl_manager,
   1672         verification_mode,
   1673         check_duplicate_keys=verification_mode == VerificationMode.BASIC_CHECKS
   1674         or verification_mode == VerificationMode.ALL_CHECKS,
   1675         **prepare_splits_kwargs,
   1676     )

File ~/git/venv/lib/python3.11/site-packages/datasets/builder.py:1004, in DatasetBuilder._download_and_prepare(self, dl_manager, verification_mode, **prepare_split_kwargs)
   1000 split_dict.add(split_generator.split_info)
   1002 try:
   1003     # Prepare split will record examples associated to the split
-> 1004     self._prepare_split(split_generator, **prepare_split_kwargs)
   1005 except OSError as e:
   1006     raise OSError(
   1007         "Cannot find data file. "
   1008         + (self.manual_download_instructions or "")
   1009         + "\nOriginal error:\n"
   1010         + str(e)
   1011     ) from None

File ~/git/venv/lib/python3.11/site-packages/datasets/builder.py:1508, in GeneratorBasedBuilder._prepare_split(self, split_generator, check_duplicate_keys, file_format, num_proc, max_shard_size)
   1506 job_id = 0
   1507 with pbar:
-> 1508     for job_id, done, content in self._prepare_split_single(
   1509         gen_kwargs=gen_kwargs, job_id=job_id, **_prepare_split_args
   1510     ):
   1511         if done:
   1512             result = content

File ~/git/venv/lib/python3.11/site-packages/datasets/builder.py:1665, in GeneratorBasedBuilder._prepare_split_single(self, gen_kwargs, fpath, file_format, max_shard_size, split_info, check_duplicate_keys, job_id)
   1663     if isinstance(e, SchemaInferenceError) and e.__context__ is not None:
   1664         e = e.__context__
-> 1665     raise DatasetGenerationError("An error occurred while generating the dataset") from e
   1667 yield job_id, True, (total_num_examples, total_num_bytes, writer._features, num_shards, shard_lengths)

DatasetGenerationError: An error occurred while generating the dataset

Besides, it works fine when I am using streamed dataset.

silverriver avatar Jul 06 '23 13:07 silverriver

simhash is the problematic column - it has values such as 18329103420363166823 that are out of the int64 range. You can fix this by setting the feature type to Value("string") (it's advised to use this type for hash values in general)

Besides, it works fine when I am using streamed dataset.

Streaming yields Python dictionaries from the script without converting them to the Arrow representation, as this conversion step is not that cheap performance-wise.

mariosasko avatar Jul 06 '23 14:07 mariosasko

i am using uint64 for simhash

uint64 ranges up to about 3.69E19.

18329103420363166823 is less than this value.

moreover, our simhash algorithm use 64 bits. it should fit in uint64.

silverriver avatar Jul 06 '23 14:07 silverriver

You are right. I overlooked the feature type.

This is a reproducer:

import pyarrow as pa
from datasets.arrow_writer import TypedSequence

pa.array(TypedSequence([18329103420363166823], type=Value("uint64")))

pa.array([18329103420363166823]) also fails with the same error, so it seems PyArrow does not always infer the correct type as NumPy does (uint64 in this case).

I'll report this issue in the Arrow repo.

pa.array([18329103420363166823], pa.uint64) works, so maybe we can implement a temporary fix (supporting complex input such as [{"image": pil_image, "num": uint64_value}] would be hard though).

In the meantime, you should be able to bypass this error by returning the simhash values as NumPy scalars in the script:

def _generate_examples(self, ...):
    ...
    yield {...,  "simhash": np.uint64(simhash), ...}

mariosasko avatar Jul 06 '23 16:07 mariosasko

Thank you for checking this issue in detail.

However, it seems that using np.uint64(simhash) does not work. The same issue still exists.

https://huggingface.co/datasets/liwu/MNBVC/commit/1e44f1e400b7e61052647d44c99cdae3bae9c830

Anyway, we decide to use string type for these simhash values. Hope pyarrow can fix their bug soon.

silverriver avatar Jul 07 '23 10:07 silverriver

Arrow issue: https://github.com/apache/arrow/issues/36520

mariosasko avatar Jul 10 '23 19:07 mariosasko

May be something read your training data line by line. Then your training data just only one line. It is so large. I guess.

kalspzzz avatar Feb 07 '24 22:02 kalspzzz