datasets icon indicating copy to clipboard operation
datasets copied to clipboard

pyarrow.lib.ArrowInvalid: Unable to merge: Field <field> has incompatible types

Open cyanic-selkie opened this issue 1 year ago • 6 comments

Describe the bug

When loading the dataset wikianc-en which I created using this code, I get the following error:

Traceback (most recent call last):
  File "/home/sven/code/rector/answer-detection/train.py", line 106, in <module>
    (dataset, weights) = get_dataset(args.dataset, tokenizer, labels, args.padding)
  File "/home/sven/code/rector/answer-detection/dataset.py", line 106, in get_dataset
    dataset = load_dataset("cyanic-selkie/wikianc-en")
  File "/home/sven/.cache/pypoetry/virtualenvs/rector-Z2mdKRnn-py3.10/lib/python3.10/site-packages/datasets/load.py", line 1794, in load_dataset
    ds = builder_instance.as_dataset(split=split, verification_mode=verification_mode, in_memory=keep_in_memory)
  File "/home/sven/.cache/pypoetry/virtualenvs/rector-Z2mdKRnn-py3.10/lib/python3.10/site-packages/datasets/builder.py", line 1106, in as_dataset
    datasets = map_nested(
  File "/home/sven/.cache/pypoetry/virtualenvs/rector-Z2mdKRnn-py3.10/lib/python3.10/site-packages/datasets/utils/py_utils.py", line 443, in map_nested
    mapped = [
  File "/home/sven/.cache/pypoetry/virtualenvs/rector-Z2mdKRnn-py3.10/lib/python3.10/site-packages/datasets/utils/py_utils.py", line 444, in <listcomp>
    _single_map_nested((function, obj, types, None, True, None))
  File "/home/sven/.cache/pypoetry/virtualenvs/rector-Z2mdKRnn-py3.10/lib/python3.10/site-packages/datasets/utils/py_utils.py", line 346, in _single_map_nested
    return function(data_struct)
  File "/home/sven/.cache/pypoetry/virtualenvs/rector-Z2mdKRnn-py3.10/lib/python3.10/site-packages/datasets/builder.py", line 1136, in _build_single_dataset
    ds = self._as_dataset(
  File "/home/sven/.cache/pypoetry/virtualenvs/rector-Z2mdKRnn-py3.10/lib/python3.10/site-packages/datasets/builder.py", line 1207, in _as_dataset
    dataset_kwargs = ArrowReader(cache_dir, self.info).read(
  File "/home/sven/.cache/pypoetry/virtualenvs/rector-Z2mdKRnn-py3.10/lib/python3.10/site-packages/datasets/arrow_reader.py", line 239, in read
    return self.read_files(files=files, original_instructions=instructions, in_memory=in_memory)
  File "/home/sven/.cache/pypoetry/virtualenvs/rector-Z2mdKRnn-py3.10/lib/python3.10/site-packages/datasets/arrow_reader.py", line 260, in read_files
    pa_table = self._read_files(files, in_memory=in_memory)
  File "/home/sven/.cache/pypoetry/virtualenvs/rector-Z2mdKRnn-py3.10/lib/python3.10/site-packages/datasets/arrow_reader.py", line 203, in _read_files
    pa_table = concat_tables(pa_tables) if len(pa_tables) != 1 else pa_tables[0]
  File "/home/sven/.cache/pypoetry/virtualenvs/rector-Z2mdKRnn-py3.10/lib/python3.10/site-packages/datasets/table.py", line 1808, in concat_tables
    return ConcatenationTable.from_tables(tables, axis=axis)
  File "/home/sven/.cache/pypoetry/virtualenvs/rector-Z2mdKRnn-py3.10/lib/python3.10/site-packages/datasets/table.py", line 1514, in from_tables
    return cls.from_blocks(blocks)
  File "/home/sven/.cache/pypoetry/virtualenvs/rector-Z2mdKRnn-py3.10/lib/python3.10/site-packages/datasets/table.py", line 1427, in from_blocks
    table = cls._concat_blocks(blocks, axis=0)
  File "/home/sven/.cache/pypoetry/virtualenvs/rector-Z2mdKRnn-py3.10/lib/python3.10/site-packages/datasets/table.py", line 1373, in _concat_blocks
    return pa.concat_tables(pa_tables, promote=True)
  File "pyarrow/table.pxi", line 5224, in pyarrow.lib.concat_tables
  File "pyarrow/error.pxi", line 144, in pyarrow.lib.pyarrow_internal_check_status
  File "pyarrow/error.pxi", line 100, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: Unable to merge: Field paragraph_anchors has incompatible types: list<: struct<start: uint32 not null, end: uint32 not null, qid: uint32, pageid: uint32, title: string not null> not null> vs list<item: struct<start: uint32, end: uint32, qid: uint32, pageid: uint32, title: string>>

This only happens when I load the train split, indicating that the size of the dataset is the deciding factor.

Steps to reproduce the bug

from datasets import load_dataset

dataset = load_dataset("cyanic-selkie/wikianc-en", split="train")

Expected behavior

The dataset should load normally without any errors.

Environment info

  • datasets version: 2.10.1
  • Platform: Linux-6.2.8-arch1-1-x86_64-with-glibc2.37
  • Python version: 3.10.10
  • PyArrow version: 11.0.0
  • Pandas version: 1.5.3

cyanic-selkie avatar Mar 31 '23 18:03 cyanic-selkie

Hi! The link pointing to the code that generated the dataset is broken. Can you please fix it to make debugging easier?

mariosasko avatar Apr 04 '23 14:04 mariosasko

Hi! The link pointing to the code that generated the dataset is broken. Can you please fix it to make debugging easier?

Sorry about that, it's fixed now.

cyanic-selkie avatar Apr 04 '23 14:04 cyanic-selkie

@cyanic-selkie could you explain how you fixed it? I met the same error in loading other datasets, is it due to the version of the library enviroment?

MingsYang avatar Sep 07 '23 10:09 MingsYang

@MingsYang I never fixed it. If you're referring to my comment above, I only meant I fixed the link to my code.

Anyway, I managed to work around the issue by using streaming when loading the dataset.

cyanic-selkie avatar Sep 07 '23 10:09 cyanic-selkie

@cyanic-selkie Emm, I get it. I just tried to use a new version python enviroment, and it show no errors anymore.

MingsYang avatar Sep 07 '23 11:09 MingsYang

Upgrade pyarrow to the latest version solves this problem in my case.

ThyrixYang avatar Jan 14 '24 07:01 ThyrixYang