datasets icon indicating copy to clipboard operation
datasets copied to clipboard

Cannot import datasets - ValueError: pyarrow.lib.IpcWriteOptions size changed, may indicate binary incompatibility

Open ehuangc opened this issue 1 year ago • 23 comments

Describe the bug

When trying to import datasets, I get a pyarrow ValueError:

Traceback (most recent call last): File "/Users/edward/test/test.py", line 1, in import datasets File "/Users/edward/opt/anaconda3/envs/cs235/lib/python3.9/site-packages/datasets/init.py", line 43, in from .arrow_dataset import Dataset File "/Users/edward/opt/anaconda3/envs/cs235/lib/python3.9/site-packages/datasets/arrow_dataset.py", line 65, in from .arrow_reader import ArrowReader File "/Users/edward/opt/anaconda3/envs/cs235/lib/python3.9/site-packages/datasets/arrow_reader.py", line 28, in import pyarrow.parquet as pq File "/Users/edward/opt/anaconda3/envs/cs235/lib/python3.9/site-packages/pyarrow/parquet/init.py", line 20, in from .core import * File "/Users/edward/opt/anaconda3/envs/cs235/lib/python3.9/site-packages/pyarrow/parquet/core.py", line 45, in from pyarrow.fs import (LocalFileSystem, FileSystem, FileType, File "/Users/edward/opt/anaconda3/envs/cs235/lib/python3.9/site-packages/pyarrow/fs.py", line 49, in from pyarrow._gcsfs import GcsFileSystem # noqa File "pyarrow/_gcsfs.pyx", line 1, in init pyarrow._gcsfs ValueError: pyarrow.lib.IpcWriteOptions size changed, may indicate binary incompatibility. Expected 88 from C header, got 72 from PyObject

Steps to reproduce the bug

import datasets

Expected behavior

Successful import

Environment info

Conda environment, MacOS python 3.9.12 datasets 2.12.0

ehuangc avatar Jun 02 '23 04:06 ehuangc

Based on https://github.com/rapidsai/cudf/issues/10187, this probably means your pyarrow installation is not compatible with datasets.

Can you please execute the following commands in the terminal and paste the output here?

conda list | grep arrow
python -c "import pyarrow; print(pyarrow.__file__)"

mariosasko avatar Jun 02 '23 16:06 mariosasko

Based on rapidsai/cudf#10187, this probably means your pyarrow installation is not compatible with datasets.

Can you please execute the following commands in the terminal and paste the output here?

conda list | grep arrow
python -c "import pyarrow; print(pyarrow.__file__)"

Here is the output to the first command:

arrow-cpp                 11.0.0           py39h7f74497_0  
pyarrow                   12.0.0                   pypi_0    pypi

and the second:

/Users/edward/opt/anaconda3/envs/cs235/lib/python3.9/site-packages/pyarrow/__init__.py

Thanks!

ehuangc avatar Jun 02 '23 19:06 ehuangc

after installing pytesseract 0.3.10, I got the above error. FYI

Joheun-Kang avatar Jun 03 '23 04:06 Joheun-Kang

RuntimeError: Failed to import transformers.trainer because of the following error (look up to see its traceback): pyarrow.lib.IpcWriteOptions size changed, may indicate binary incompatibility. Expected 88 from C header, got 72 from PyObject

Joheun-Kang avatar Jun 03 '23 04:06 Joheun-Kang

I got the same error, pyarrow 12.0.0 released May/2023 (https://pypi.org/project/pyarrow/) is not compatible, running pip install pyarrow==11.0.0 to force install the previous version solved the problem.

Do we need to update dependencies?

ssydyc avatar Jun 04 '23 23:06 ssydyc

Please note that our CI properly passes all tests with pyarrow-12.0.0, for Python 3.7 and Python 3.10, for Ubuntu and Windows: see for example https://github.com/huggingface/datasets/actions/runs/5157324334/jobs/9289582291

albertvillanova avatar Jun 05 '23 05:06 albertvillanova

For conda with python3.8.16 this solved my problem! thanks!

I got the same error, pyarrow 12.0.0 released May/2023 (https://pypi.org/project/pyarrow/) is not compatible, running pip install pyarrow==11.0.0 to force install the previous version solved the problem.

Do we need to update dependencies? I can work on that if no one else is working on it.

Joheun-Kang avatar Jun 05 '23 07:06 Joheun-Kang

Thanks for replying. I am not sure about those environments but it seems like pyarrow-12.0.0 does not work for conda with python 3.8.16.

Please note that our CI properly passes all tests with pyarrow-12.0.0, for Python 3.7 and Python 3.10, for Ubuntu and Windows: see for example https://github.com/huggingface/datasets/actions/runs/5157324334/jobs/9289582291

Joheun-Kang avatar Jun 05 '23 07:06 Joheun-Kang

Got the same error with:

arrow-cpp                 11.0.0          py310h7516544_0  
pyarrow                   12.0.0                   pypi_0    pypi

python                    3.10.11              h7a1cb2a_2  

datasets                  2.13.0             pyhd8ed1ab_0    conda-forge

lorelupo avatar Jun 19 '23 10:06 lorelupo

I got the same error, pyarrow 12.0.0 released May/2023 (https://pypi.org/project/pyarrow/) is not compatible, running pip install pyarrow==11.0.0 to force install the previous version solved the problem.

Do we need to update dependencies?

This solved the issue for me as well.

lorelupo avatar Jun 19 '23 10:06 lorelupo

I got the same error, pyarrow 12.0.0 released May/2023 (https://pypi.org/project/pyarrow/) is not compatible, running pip install pyarrow==11.0.0 to force install the previous version solved the problem.

Do we need to update dependencies?

Solved it for me also

imarquart avatar Jun 26 '23 17:06 imarquart

基于 rapidsai/cudf#10187,这可能意味着您的安装与 不兼容。pyarrow``datasets

您能否在终端中执行以下命令并将输出粘贴到此处?

conda list | grep arrow
python -c "import pyarrow; print(pyarrow.__file__)"

arrow-cpp 11.0.0 py310h7516544_0
pyarrow 12.0.1 pypi_0 pypi

/root/miniconda3/lib/python3.10/site-packages/pyarrow/init.py

YY0649 avatar Jul 11 '23 11:07 YY0649

Got the same problem with

arrow-cpp 11.0.0 py310h1fc3239_0
pyarrow 12.0.1 pypi_0 pypi

miniforge3/envs/mlp/lib/python3.10/site-packages/pyarrow/init.py

Reverting back to pyarrow 11 solved the problem.

kimjansheden avatar Jul 23 '23 20:07 kimjansheden

Solved with pip install pyarrow==11.0.0

B8ni avatar Aug 07 '23 08:08 B8ni

I got different. Solved with pip install pyarrow==12.0.1 pip install cchardet

env: Python 3.9.16 transformers 4.32.1

wolf-li avatar Aug 31 '23 02:08 wolf-li

I got the same error, pyarrow 12.0.0 released May/2023 (https://pypi.org/project/pyarrow/) is not compatible, running pip install pyarrow==11.0.0 to force install the previous version solved the problem.

Do we need to update dependencies?

This works for me as well

5uryansh avatar Sep 03 '23 04:09 5uryansh

I got different. Solved with pip install pyarrow==12.0.1 pip install cchardet

env: Python 3.9.16 transformers 4.32.1

I guess it also depends on the Python version. I got Python 3.11.5 and pyarrow==12.0.0. It works!

williamLyh avatar Nov 08 '23 20:11 williamLyh

Hi, if this helps anyone, pip install pyarrow==11.0.0 did not work for me (I'm using Colab) but this worked: !pip install --extra-index-url=https://pypi.nvidia.com cudf-cu11

thierrydecae avatar Dec 13 '23 15:12 thierrydecae

Hi, if this helps anyone, pip install pyarrow==11.0.0 did not work for me (I'm using Colab) but this worked: !pip install --extra-index-url=https://pypi.nvidia.com cudf-cu11

thanks! I met the same problem and your suggestion solved it.

JerryRen471 avatar Feb 07 '24 17:02 JerryRen471

(I was doing quiet install so I didn't notice it initially) I've been loading the same dataset for months on Colab, just now I got this error as well. I think Colab has changed their image recently (I had some errors regarding CUDA previously as well). beware of this and restart runtime if you're doing quite pip installs. moreover installing stable version of datasets on pypi gives this:

ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
ibis-framework 7.1.0 requires pyarrow<15,>=2, but you have pyarrow 15.0.0 which is incompatible.
Successfully installed datasets-2.17.0 dill-0.3.8 multiprocess-0.70.16 pyarrow-15.0.0
WARNING: The following packages were previously imported in this runtime:
  [pyarrow]
You must restart the runtime in order to use newly installed versions.

merveenoyan avatar Feb 10 '24 22:02 merveenoyan

for colab - pip install pyarrow==11.0.0

rasith1998 avatar Feb 15 '24 01:02 rasith1998

The above methods didn't help me. So I installed an older version: !pip install datasets==2.16.1 and import datasets worked!!

PennlaineChu avatar Feb 15 '24 06:02 PennlaineChu

@rasith1998 @PennlaineChu You can avoid this issue by restarting the session after the datasets installation (see https://github.com/huggingface/datasets/issues/6661 for more info)

Also, we've contacted Google Colab folks to update the default PyArrow installation, so the issue should soon be "officially" resolved on their side.

mariosasko avatar Feb 15 '24 17:02 mariosasko

Also, we've contacted Google Colab folks to update the default PyArrow installation, so the issue should soon be "officially" resolved on their side.

This has been done! Google Colab now pre-installs PyArrow 14.0.2, which makes this issue unlikely to happen, so I'm closing it.

mariosasko avatar Feb 25 '24 16:02 mariosasko

I am facing this issue outside of Colab, in a normal Python (3.10.14) environment:

pyarrow==11.0.0
datasets=2.20.0
transformers==4.41.2

What can I do to solve it?

I am somewhat bound to pyarrow==11.0.0. Is there a version of datasets that supports this?

MinuraPunchihewa avatar Jun 27 '24 09:06 MinuraPunchihewa