datasets
datasets copied to clipboard
pyarrow.lib.ArrowInvalid: Unable to merge: Field <field> has incompatible types
Describe the bug
When loading the dataset wikianc-en which I created using this code, I get the following error:
Traceback (most recent call last):
File "/home/sven/code/rector/answer-detection/train.py", line 106, in <module>
(dataset, weights) = get_dataset(args.dataset, tokenizer, labels, args.padding)
File "/home/sven/code/rector/answer-detection/dataset.py", line 106, in get_dataset
dataset = load_dataset("cyanic-selkie/wikianc-en")
File "/home/sven/.cache/pypoetry/virtualenvs/rector-Z2mdKRnn-py3.10/lib/python3.10/site-packages/datasets/load.py", line 1794, in load_dataset
ds = builder_instance.as_dataset(split=split, verification_mode=verification_mode, in_memory=keep_in_memory)
File "/home/sven/.cache/pypoetry/virtualenvs/rector-Z2mdKRnn-py3.10/lib/python3.10/site-packages/datasets/builder.py", line 1106, in as_dataset
datasets = map_nested(
File "/home/sven/.cache/pypoetry/virtualenvs/rector-Z2mdKRnn-py3.10/lib/python3.10/site-packages/datasets/utils/py_utils.py", line 443, in map_nested
mapped = [
File "/home/sven/.cache/pypoetry/virtualenvs/rector-Z2mdKRnn-py3.10/lib/python3.10/site-packages/datasets/utils/py_utils.py", line 444, in <listcomp>
_single_map_nested((function, obj, types, None, True, None))
File "/home/sven/.cache/pypoetry/virtualenvs/rector-Z2mdKRnn-py3.10/lib/python3.10/site-packages/datasets/utils/py_utils.py", line 346, in _single_map_nested
return function(data_struct)
File "/home/sven/.cache/pypoetry/virtualenvs/rector-Z2mdKRnn-py3.10/lib/python3.10/site-packages/datasets/builder.py", line 1136, in _build_single_dataset
ds = self._as_dataset(
File "/home/sven/.cache/pypoetry/virtualenvs/rector-Z2mdKRnn-py3.10/lib/python3.10/site-packages/datasets/builder.py", line 1207, in _as_dataset
dataset_kwargs = ArrowReader(cache_dir, self.info).read(
File "/home/sven/.cache/pypoetry/virtualenvs/rector-Z2mdKRnn-py3.10/lib/python3.10/site-packages/datasets/arrow_reader.py", line 239, in read
return self.read_files(files=files, original_instructions=instructions, in_memory=in_memory)
File "/home/sven/.cache/pypoetry/virtualenvs/rector-Z2mdKRnn-py3.10/lib/python3.10/site-packages/datasets/arrow_reader.py", line 260, in read_files
pa_table = self._read_files(files, in_memory=in_memory)
File "/home/sven/.cache/pypoetry/virtualenvs/rector-Z2mdKRnn-py3.10/lib/python3.10/site-packages/datasets/arrow_reader.py", line 203, in _read_files
pa_table = concat_tables(pa_tables) if len(pa_tables) != 1 else pa_tables[0]
File "/home/sven/.cache/pypoetry/virtualenvs/rector-Z2mdKRnn-py3.10/lib/python3.10/site-packages/datasets/table.py", line 1808, in concat_tables
return ConcatenationTable.from_tables(tables, axis=axis)
File "/home/sven/.cache/pypoetry/virtualenvs/rector-Z2mdKRnn-py3.10/lib/python3.10/site-packages/datasets/table.py", line 1514, in from_tables
return cls.from_blocks(blocks)
File "/home/sven/.cache/pypoetry/virtualenvs/rector-Z2mdKRnn-py3.10/lib/python3.10/site-packages/datasets/table.py", line 1427, in from_blocks
table = cls._concat_blocks(blocks, axis=0)
File "/home/sven/.cache/pypoetry/virtualenvs/rector-Z2mdKRnn-py3.10/lib/python3.10/site-packages/datasets/table.py", line 1373, in _concat_blocks
return pa.concat_tables(pa_tables, promote=True)
File "pyarrow/table.pxi", line 5224, in pyarrow.lib.concat_tables
File "pyarrow/error.pxi", line 144, in pyarrow.lib.pyarrow_internal_check_status
File "pyarrow/error.pxi", line 100, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: Unable to merge: Field paragraph_anchors has incompatible types: list<: struct<start: uint32 not null, end: uint32 not null, qid: uint32, pageid: uint32, title: string not null> not null> vs list<item: struct<start: uint32, end: uint32, qid: uint32, pageid: uint32, title: string>>
This only happens when I load the train
split, indicating that the size of the dataset is the deciding factor.
Steps to reproduce the bug
from datasets import load_dataset
dataset = load_dataset("cyanic-selkie/wikianc-en", split="train")
Expected behavior
The dataset should load normally without any errors.
Environment info
-
datasets
version: 2.10.1 - Platform: Linux-6.2.8-arch1-1-x86_64-with-glibc2.37
- Python version: 3.10.10
- PyArrow version: 11.0.0
- Pandas version: 1.5.3
Hi! The link pointing to the code that generated the dataset is broken. Can you please fix it to make debugging easier?
Hi! The link pointing to the code that generated the dataset is broken. Can you please fix it to make debugging easier?
Sorry about that, it's fixed now.
@cyanic-selkie could you explain how you fixed it? I met the same error in loading other datasets, is it due to the version of the library enviroment?
@MingsYang I never fixed it. If you're referring to my comment above, I only meant I fixed the link to my code.
Anyway, I managed to work around the issue by using streaming
when loading the dataset.
@cyanic-selkie Emm, I get it. I just tried to use a new version python enviroment, and it show no errors anymore.
Upgrade pyarrow to the latest version solves this problem in my case.