datasets icon indicating copy to clipboard operation
datasets copied to clipboard

Version mismatch with multiprocess and dill on Python 3.10

Open adampauls opened this issue 2 years ago • 6 comments

Describe the bug

Grabbing the latest version of datasets and apache-beam with poetry using Python 3.10 gives a crash at runtime. The crash is

File "/Users/adpauls/sc/git/DSI-transformers/data/NQ/create_NQ_train_vali.py", line 1, in <module>
    import datasets
  File "/Users/adpauls/Library/Caches/pypoetry/virtualenvs/yyy-oPbZ7mKM-py3.10/lib/python3.10/site-packages/datasets/__init__.py", line 43, in <module>
    from .arrow_dataset import Dataset
  File "/Users/adpauls/Library/Caches/pypoetry/virtualenvs/yyy-oPbZ7mKM-py3.10/lib/python3.10/site-packages/datasets/arrow_dataset.py", line 65, in <module>
    from .arrow_reader import ArrowReader
  File "/Users/adpauls/Library/Caches/pypoetry/virtualenvs/yyy-oPbZ7mKM-py3.10/lib/python3.10/site-packages/datasets/arrow_reader.py", line 30, in <module>
    from .download.download_config import DownloadConfig
  File "/Users/adpauls/Library/Caches/pypoetry/virtualenvs/yyy-oPbZ7mKM-py3.10/lib/python3.10/site-packages/datasets/download/__init__.py", line 9, in <module>
    from .download_manager import DownloadManager, DownloadMode
  File "/Users/adpauls/Library/Caches/pypoetry/virtualenvs/yyy-oPbZ7mKM-py3.10/lib/python3.10/site-packages/datasets/download/download_manager.py", line 35, in <module>
    from ..utils.py_utils import NestedDataStructure, map_nested, size_str
  File "/Users/adpauls/Library/Caches/pypoetry/virtualenvs/yyy-oPbZ7mKM-py3.10/lib/python3.10/site-packages/datasets/utils/py_utils.py", line 40, in <module>
    import multiprocess.pool
  File "/Users/adpauls/Library/Caches/pypoetry/virtualenvs/yyy-oPbZ7mKM-py3.10/lib/python3.10/site-packages/multiprocess/pool.py", line 609, in <module>
    class ThreadPool(Pool):
  File "/Users/adpauls/Library/Caches/pypoetry/virtualenvs/yyy-oPbZ7mKM-py3.10/lib/python3.10/site-packages/multiprocess/pool.py", line 611, in ThreadPool
    from .dummy import Process
  File "/Users/adpauls/Library/Caches/pypoetry/virtualenvs/yyy-oPbZ7mKM-py3.10/lib/python3.10/site-packages/multiprocess/dummy/__init__.py", line 87, in <module>
    class Condition(threading._Condition):
AttributeError: module 'threading' has no attribute '_Condition'. Did you mean: 'Condition'?

I think this is a bad interaction of versions from dill, multiprocess, apache-beam, and threading from the Python (3.10) standard lib. Upgrading multiprocess to a version that does not crash like this is not possible because apache-beam pins dill to and old version:

Because multiprocess (0.70.10) depends on dill (>=0.3.2)
 and apache-beam (2.45.0) depends on dill (>=0.3.1.1,<0.3.2), multiprocess (0.70.10) is incompatible with apache-beam (2.45.0).
And because no versions of apache-beam match >2.45.0,<3.0.0, multiprocess (0.70.10) is incompatible with apache-beam (>=2.45.0,<3.0.0).
So, because yyy depends on both apache-beam (^2.45.0) and multiprocess (0.70.10), version solving failed.

Perhaps it is not right to file a bug here, but I'm not totally sure whose fault it is. And in any case, this is an immediate blocker to using datasets out of the box.

Possibly related to https://github.com/huggingface/datasets/issues/5232.

Steps to reproduce the bug

Steps to reproduce:

  1. Make a poetry project with this configuration
    [tool.poetry]
    name = "yyy"
    version = "0.1.0"
    description = ""
    authors = ["Adam Pauls <[email protected]>"]
    readme = "README.md" 
    packages = [{ include = "xxx" }]
    
    [tool.poetry.dependencies]   
    python = ">=3.10,<3.11"
    datasets = "^2.10.1"
    apache-beam = "^2.45.0"
    
    [build-system]
    requires = ["poetry-core"]  
    build-backend = "poetry.core.masonry.api"
    
    1. poetry install.
    2. poetry run python -c "import datasets".

Expected behavior

Script runs.

Environment info

Python 3.10. Here are the versions installed by poetry:

•• Installing frozenlist (1.3.3)
  • Installing idna (3.4)
  • Installing multidict (6.0.4)
  • Installing aiosignal (1.3.1)
  • Installing async-timeout (4.0.2)
  • Installing attrs (22.2.0)
  • Installing certifi (2022.12.7)
  • Installing charset-normalizer (3.1.0)
  • Installing six (1.16.0)
  • Installing urllib3 (1.26.14)
  • Installing yarl (1.8.2)
  • Installing aiohttp (3.8.4)
  • Installing dill (0.3.1.1)
  • Installing docopt (0.6.2)
  • Installing filelock (3.9.0)
  • Installing numpy (1.22.4)
  • Installing pyparsing (3.0.9)
  • Installing protobuf (3.19.4)
  • Installing packaging (23.0)
  • Installing python-dateutil (2.8.2)
  • Installing pytz (2022.7.1)
  • Installing pyyaml (6.0)
  • Installing requests (2.28.2)
  • Installing tqdm (4.65.0)
  • Installing typing-extensions (4.5.0)
  • Installing cloudpickle (2.2.1)
  • Installing crcmod (1.7)
  • Installing fastavro (1.7.2)
  • Installing fasteners (0.18)
  • Installing fsspec (2023.3.0)
  • Installing grpcio (1.51.3)
  • Installing hdfs (2.7.0)
  • Installing httplib2 (0.20.4)
  • Installing huggingface-hub (0.12.1)
  • Installing multiprocess (0.70.9)
  • Installing objsize (0.6.1)
  • Installing orjson (3.8.7)
  • Installing pandas (1.5.3)
  • Installing proto-plus (1.22.2)
  • Installing pyarrow (9.0.0)
  • Installing pydot (1.4.2)
  • Installing pymongo (3.13.0)
  • Installing regex (2022.10.31)
  • Installing responses (0.18.0)
  • Installing xxhash (3.2.0)
  • Installing zstandard (0.20.0)
  • Installing apache-beam (2.45.0)
  • Installing datasets (2.10.1)

adampauls avatar Mar 06 '23 17:03 adampauls

Sorry, I just found https://github.com/apache/beam/issues/24458. It seems this issue is being worked on.

adampauls avatar Mar 06 '23 17:03 adampauls

Reopening, since I think the docs should inform the user of this problem. For example, this page says

Datasets is tested on Python 3.7+.

but it should probably say that Beam Datasets do not work with Python 3.10 (or link to a known issues page).

adampauls avatar Mar 07 '23 00:03 adampauls

Same problem on Colab using a vanilla setup running : Python 3.10.11 apache-beam 2.47.0 datasets 2.12.0

jeromemassot avatar May 11 '23 14:05 jeromemassot

Same problem, py 3.10.11 apache-beam==2.47.0 datasets==2.12.0

sergesteban avatar May 28 '23 01:05 sergesteban

I have made a workaround by forcing an install of the version of multiprocess version 0.70.15 (after installing datasets and apache-beam). I can confirm that (on Python 3.10 in this colab notebook) datasets can download pre-processed Wikipedia dumps and can download non-pre-processed dumps using beam_runner="DirectRunner". I don't know if/how other beam_runners can be made compatible.

boyleconnor avatar Sep 01 '23 18:09 boyleconnor

Same problem.

python = "^3.10"
apache-beam = { extras = ["gcp"], version = "2.54.0" }
datasets = "^2.18.0"

axelmagn avatar Apr 05 '24 20:04 axelmagn