datasets
datasets copied to clipboard
Version mismatch with multiprocess and dill on Python 3.10
Describe the bug
Grabbing the latest version of datasets and apache-beam with poetry using Python 3.10 gives a crash at runtime. The crash is
File "/Users/adpauls/sc/git/DSI-transformers/data/NQ/create_NQ_train_vali.py", line 1, in <module>
import datasets
File "/Users/adpauls/Library/Caches/pypoetry/virtualenvs/yyy-oPbZ7mKM-py3.10/lib/python3.10/site-packages/datasets/__init__.py", line 43, in <module>
from .arrow_dataset import Dataset
File "/Users/adpauls/Library/Caches/pypoetry/virtualenvs/yyy-oPbZ7mKM-py3.10/lib/python3.10/site-packages/datasets/arrow_dataset.py", line 65, in <module>
from .arrow_reader import ArrowReader
File "/Users/adpauls/Library/Caches/pypoetry/virtualenvs/yyy-oPbZ7mKM-py3.10/lib/python3.10/site-packages/datasets/arrow_reader.py", line 30, in <module>
from .download.download_config import DownloadConfig
File "/Users/adpauls/Library/Caches/pypoetry/virtualenvs/yyy-oPbZ7mKM-py3.10/lib/python3.10/site-packages/datasets/download/__init__.py", line 9, in <module>
from .download_manager import DownloadManager, DownloadMode
File "/Users/adpauls/Library/Caches/pypoetry/virtualenvs/yyy-oPbZ7mKM-py3.10/lib/python3.10/site-packages/datasets/download/download_manager.py", line 35, in <module>
from ..utils.py_utils import NestedDataStructure, map_nested, size_str
File "/Users/adpauls/Library/Caches/pypoetry/virtualenvs/yyy-oPbZ7mKM-py3.10/lib/python3.10/site-packages/datasets/utils/py_utils.py", line 40, in <module>
import multiprocess.pool
File "/Users/adpauls/Library/Caches/pypoetry/virtualenvs/yyy-oPbZ7mKM-py3.10/lib/python3.10/site-packages/multiprocess/pool.py", line 609, in <module>
class ThreadPool(Pool):
File "/Users/adpauls/Library/Caches/pypoetry/virtualenvs/yyy-oPbZ7mKM-py3.10/lib/python3.10/site-packages/multiprocess/pool.py", line 611, in ThreadPool
from .dummy import Process
File "/Users/adpauls/Library/Caches/pypoetry/virtualenvs/yyy-oPbZ7mKM-py3.10/lib/python3.10/site-packages/multiprocess/dummy/__init__.py", line 87, in <module>
class Condition(threading._Condition):
AttributeError: module 'threading' has no attribute '_Condition'. Did you mean: 'Condition'?
I think this is a bad interaction of versions from dill, multiprocess, apache-beam, and threading from the Python (3.10) standard lib. Upgrading multiprocess to a version that does not crash like this is not possible because apache-beam pins dill to and old version:
Because multiprocess (0.70.10) depends on dill (>=0.3.2)
and apache-beam (2.45.0) depends on dill (>=0.3.1.1,<0.3.2), multiprocess (0.70.10) is incompatible with apache-beam (2.45.0).
And because no versions of apache-beam match >2.45.0,<3.0.0, multiprocess (0.70.10) is incompatible with apache-beam (>=2.45.0,<3.0.0).
So, because yyy depends on both apache-beam (^2.45.0) and multiprocess (0.70.10), version solving failed.
Perhaps it is not right to file a bug here, but I'm not totally sure whose fault it is. And in any case, this is an immediate blocker to using datasets out of the box.
Possibly related to https://github.com/huggingface/datasets/issues/5232.
Steps to reproduce the bug
Steps to reproduce:
- Make a poetry project with this configuration
[tool.poetry] name = "yyy" version = "0.1.0" description = "" authors = ["Adam Pauls <[email protected]>"] readme = "README.md" packages = [{ include = "xxx" }] [tool.poetry.dependencies] python = ">=3.10,<3.11" datasets = "^2.10.1" apache-beam = "^2.45.0" [build-system] requires = ["poetry-core"] build-backend = "poetry.core.masonry.api"poetry install.poetry run python -c "import datasets".
Expected behavior
Script runs.
Environment info
Python 3.10. Here are the versions installed by poetry:
•• Installing frozenlist (1.3.3)
• Installing idna (3.4)
• Installing multidict (6.0.4)
• Installing aiosignal (1.3.1)
• Installing async-timeout (4.0.2)
• Installing attrs (22.2.0)
• Installing certifi (2022.12.7)
• Installing charset-normalizer (3.1.0)
• Installing six (1.16.0)
• Installing urllib3 (1.26.14)
• Installing yarl (1.8.2)
• Installing aiohttp (3.8.4)
• Installing dill (0.3.1.1)
• Installing docopt (0.6.2)
• Installing filelock (3.9.0)
• Installing numpy (1.22.4)
• Installing pyparsing (3.0.9)
• Installing protobuf (3.19.4)
• Installing packaging (23.0)
• Installing python-dateutil (2.8.2)
• Installing pytz (2022.7.1)
• Installing pyyaml (6.0)
• Installing requests (2.28.2)
• Installing tqdm (4.65.0)
• Installing typing-extensions (4.5.0)
• Installing cloudpickle (2.2.1)
• Installing crcmod (1.7)
• Installing fastavro (1.7.2)
• Installing fasteners (0.18)
• Installing fsspec (2023.3.0)
• Installing grpcio (1.51.3)
• Installing hdfs (2.7.0)
• Installing httplib2 (0.20.4)
• Installing huggingface-hub (0.12.1)
• Installing multiprocess (0.70.9)
• Installing objsize (0.6.1)
• Installing orjson (3.8.7)
• Installing pandas (1.5.3)
• Installing proto-plus (1.22.2)
• Installing pyarrow (9.0.0)
• Installing pydot (1.4.2)
• Installing pymongo (3.13.0)
• Installing regex (2022.10.31)
• Installing responses (0.18.0)
• Installing xxhash (3.2.0)
• Installing zstandard (0.20.0)
• Installing apache-beam (2.45.0)
• Installing datasets (2.10.1)
Sorry, I just found https://github.com/apache/beam/issues/24458. It seems this issue is being worked on.
Reopening, since I think the docs should inform the user of this problem. For example, this page says
Datasets is tested on Python 3.7+.
but it should probably say that Beam Datasets do not work with Python 3.10 (or link to a known issues page).
Same problem on Colab using a vanilla setup running : Python 3.10.11 apache-beam 2.47.0 datasets 2.12.0
Same problem, py 3.10.11 apache-beam==2.47.0 datasets==2.12.0
I have made a workaround by forcing an install of the version of multiprocess version 0.70.15 (after installing datasets and apache-beam). I can confirm that (on Python 3.10 in this colab notebook) datasets can download pre-processed Wikipedia dumps and can download non-pre-processed dumps using beam_runner="DirectRunner". I don't know if/how other beam_runners can be made compatible.
Same problem.
python = "^3.10"
apache-beam = { extras = ["gcp"], version = "2.54.0" }
datasets = "^2.18.0"