datasets pyarrow.lib.ArrowInvalid: Unable to merge: Field <field> has incompatible types

Describe the bug

When loading the dataset wikianc-en which I created using this code, I get the following error:

Traceback (most recent call last):
  File "/home/sven/code/rector/answer-detection/train.py", line 106, in <module>
    (dataset, weights) = get_dataset(args.dataset, tokenizer, labels, args.padding)
  File "/home/sven/code/rector/answer-detection/dataset.py", line 106, in get_dataset
    dataset = load_dataset("cyanic-selkie/wikianc-en")
  File "/home/sven/.cache/pypoetry/virtualenvs/rector-Z2mdKRnn-py3.10/lib/python3.10/site-packages/datasets/load.py", line 1794, in load_dataset
    ds = builder_instance.as_dataset(split=split, verification_mode=verification_mode, in_memory=keep_in_memory)
  File "/home/sven/.cache/pypoetry/virtualenvs/rector-Z2mdKRnn-py3.10/lib/python3.10/site-packages/datasets/builder.py", line 1106, in as_dataset
    datasets = map_nested(
  File "/home/sven/.cache/pypoetry/virtualenvs/rector-Z2mdKRnn-py3.10/lib/python3.10/site-packages/datasets/utils/py_utils.py", line 443, in map_nested
    mapped = [
  File "/home/sven/.cache/pypoetry/virtualenvs/rector-Z2mdKRnn-py3.10/lib/python3.10/site-packages/datasets/utils/py_utils.py", line 444, in <listcomp>
    _single_map_nested((function, obj, types, None, True, None))
  File "/home/sven/.cache/pypoetry/virtualenvs/rector-Z2mdKRnn-py3.10/lib/python3.10/site-packages/datasets/utils/py_utils.py", line 346, in _single_map_nested
    return function(data_struct)
  File "/home/sven/.cache/pypoetry/virtualenvs/rector-Z2mdKRnn-py3.10/lib/python3.10/site-packages/datasets/builder.py", line 1136, in _build_single_dataset
    ds = self._as_dataset(
  File "/home/sven/.cache/pypoetry/virtualenvs/rector-Z2mdKRnn-py3.10/lib/python3.10/site-packages/datasets/builder.py", line 1207, in _as_dataset
    dataset_kwargs = ArrowReader(cache_dir, self.info).read(
  File "/home/sven/.cache/pypoetry/virtualenvs/rector-Z2mdKRnn-py3.10/lib/python3.10/site-packages/datasets/arrow_reader.py", line 239, in read
    return self.read_files(files=files, original_instructions=instructions, in_memory=in_memory)
  File "/home/sven/.cache/pypoetry/virtualenvs/rector-Z2mdKRnn-py3.10/lib/python3.10/site-packages/datasets/arrow_reader.py", line 260, in read_files
    pa_table = self._read_files(files, in_memory=in_memory)
  File "/home/sven/.cache/pypoetry/virtualenvs/rector-Z2mdKRnn-py3.10/lib/python3.10/site-packages/datasets/arrow_reader.py", line 203, in _read_files
    pa_table = concat_tables(pa_tables) if len(pa_tables) != 1 else pa_tables[0]
  File "/home/sven/.cache/pypoetry/virtualenvs/rector-Z2mdKRnn-py3.10/lib/python3.10/site-packages/datasets/table.py", line 1808, in concat_tables
    return ConcatenationTable.from_tables(tables, axis=axis)
  File "/home/sven/.cache/pypoetry/virtualenvs/rector-Z2mdKRnn-py3.10/lib/python3.10/site-packages/datasets/table.py", line 1514, in from_tables
    return cls.from_blocks(blocks)
  File "/home/sven/.cache/pypoetry/virtualenvs/rector-Z2mdKRnn-py3.10/lib/python3.10/site-packages/datasets/table.py", line 1427, in from_blocks
    table = cls._concat_blocks(blocks, axis=0)
  File "/home/sven/.cache/pypoetry/virtualenvs/rector-Z2mdKRnn-py3.10/lib/python3.10/site-packages/datasets/table.py", line 1373, in _concat_blocks
    return pa.concat_tables(pa_tables, promote=True)
  File "pyarrow/table.pxi", line 5224, in pyarrow.lib.concat_tables
  File "pyarrow/error.pxi", line 144, in pyarrow.lib.pyarrow_internal_check_status
  File "pyarrow/error.pxi", line 100, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: Unable to merge: Field paragraph_anchors has incompatible types: list<: struct<start: uint32 not null, end: uint32 not null, qid: uint32, pageid: uint32, title: string not null> not null> vs list<item: struct<start: uint32, end: uint32, qid: uint32, pageid: uint32, title: string>>

This only happens when I load the train split, indicating that the size of the dataset is the deciding factor.

Steps to reproduce the bug

from datasets import load_dataset

dataset = load_dataset("cyanic-selkie/wikianc-en", split="train")

Expected behavior

The dataset should load normally without any errors.

Environment info

datasets version: 2.10.1
Platform: Linux-6.2.8-arch1-1-x86_64-with-glibc2.37
Python version: 3.10.10
PyArrow version: 11.0.0
Pandas version: 1.5.3

Mar 31 '23 18:03 cyanic-selkie

Hi! The link pointing to the code that generated the dataset is broken. Can you please fix it to make debugging easier?

Apr 04 '23 14:04 mariosasko

Hi! The link pointing to the code that generated the dataset is broken. Can you please fix it to make debugging easier?

Sorry about that, it's fixed now.

Apr 04 '23 14:04 cyanic-selkie

@cyanic-selkie could you explain how you fixed it? I met the same error in loading other datasets, is it due to the version of the library enviroment?

Sep 07 '23 10:09 MingsYang

@MingsYang I never fixed it. If you're referring to my comment above, I only meant I fixed the link to my code.

Anyway, I managed to work around the issue by using streaming when loading the dataset.

Sep 07 '23 10:09 cyanic-selkie

@cyanic-selkie Emm, I get it. I just tried to use a new version python enviroment, and it show no errors anymore.

Sep 07 '23 11:09 MingsYang

Upgrade pyarrow to the latest version solves this problem in my case.

Jan 14 '24 07:01 ThyrixYang