ray
ray copied to clipboard
Widening range of grpcio versions allowed.
See https://github.com/ray-project/ray/issues/27299 for context.
script.py
Testing with the following code:
#!/usr/bin/env python3
import ray
ray.init()
@ray.remote
def task(argument):
import grpc
import platform
print(argument, grpc.__version__, platform.python_version())
ray.get(task.remote('Hello world'))
python/grpcio versions tested on above script
grpcio==1.43.0 python==3.10.4
grpcio==1.44.0 python==3.10.4
grpcio==1.45.0 python==3.10.4
grpcio==1.46.0 python==3.10.4
grpcio==1.47.0 python==3.10.4
grpcio==1.48.1 python==3.10.4
grpcio==1.43.0 python==3.6.13
grpcio==1.44.0 python==3.6.13
grpcio==1.45.0 python==3.6.13
grpcio==1.46.0 python==3.6.13
grpcio==1.47.0 python==3.6.13
grpcio==1.48.1 python==3.6.13
grpcio==1.43.0 python==3.7.13
grpcio==1.44.0 python==3.7.13
grpcio==1.45.0 python==3.7.13
grpcio==1.46.0 python==3.7.13
grpcio==1.47.0 python==3.7.13
grpcio==1.48.1 python==3.7.13
grpcio==1.43.0 python==3.8.13
grpcio==1.44.0 python==3.8.13
grpcio==1.45.0 python==3.8.13
grpcio==1.46.0 python==3.8.13
grpcio==1.47.0 python==3.8.13
grpcio==1.48.1 python==3.8.13
grpcio==1.43.0 python==3.9.13
grpcio==1.44.0 python==3.9.13
grpcio==1.45.0 python==3.9.13
grpcio==1.46.0 python==3.9.13
grpcio==1.47.0 python==3.9.13
grpcio==1.48.1 python==3.9.13
test code
#!/usr/bin/env bash
set -e
source /home/ray/anaconda3/etc/profile.d/conda.sh
pr_commit="0193a19226c29c9988760114d67f6ea9af99f9e7"
ray_wheels="$(aws s3 ls s3://ray-ci-artifact-pr-public/$pr_commit/tmp/artifacts/.whl/ | grep -v 'cpp' | awk '{print $4}')"
grpcio_versions="1.43 1.44 1.45 1.46 1.47 1.48.1"
for ray_wheel in $ray_wheels; do
conda_create_cmd=$(echo $ray_wheel | sed 's/-/ /'g | awk '{print $3}' | sed 's/cp//g' | sed 's/3/3\./g' | sed 's/^/conda create -n temp python=/g')
$conda_create_cmd --yes
conda activate temp
for grpcio_version in $grpcio_versions; do
printf "Uninstalling\n"
pip uninstall grpcio -y
pip uninstall ray -y
printf "Installing grpcio_version $grpcio_version\n"
pip install grpcio==$grpcio_version
printf "Installing Ray wheel $ray_wheel"
pip install "https://ray-ci-artifact-pr-public.s3.us-west-2.amazonaws.com/$pr_commit/tmp/artifacts/.whl/$ray_wheel"
printf "Running script on grpcio_version $grpcio_version\n"
./script.py
done
done
@jjyao what is the best way to test the setup.py
installation process? I need to test python<3.10 and python>=3.10, is conda envs the best way?
what is the best way to test the setup.py installation process? I need to test python<3.10 and python>=3.10, is conda envs the best way?
Yea, I would use conda for that.
Could you pull the master head? Want to make sure failed tests are unrelated.
Notes on windows tests:
- The test
FAILED ::test_scheduling_class_depth[ray_start_regular0]
failed on windows three times, so unlikely to be a flaky timeout. -
tests:test_scheduling_asan
looks like it's taking longer. On the third attempt of Window 5/6, it fails due to timeout, the second attempt it takes longer than expected.
hmm, we should still set the upper bound? for core dependencies, we should only approve a version until it's proven innocent.
cc @jjyao
hmm, we should still set the upper bound? for core dependencies, we should only approve a version until it's proven innocent.
@scv119 This is not the strategy we are following due to the nature of Ray as a library (if you look at setup.py, we don't have upper bounds for most of (core) dependencies). Ray shouldn't prevent users from using the latest versions of dependencies (they may want to use that for some reason like bug fix) that might come out after Ray's release.
hmm this (setting upper bound) has been the practical strategy we have been following for grpcio right (and other core dependencies)? Empirically, for grpcio, not setting upper bound has caused serious problems (i.e. ray hangs), comparing to the issues where users are not able to upgrade to the latest grpcio.
grpcio doesn't have upper bound by default as well. I think what we did in the past was forcing an upper bound when we discover an issue and lift the upper bound afterwards after the issue is fixed. (this is what we normally do for other core dependencies as well).
The strategy I mentioned is a general strategy we apply to all of our dependencies. Whether we want to make an exception for grpcio is a separate story but I don't think it should apply to all dependencies.
Personally I still feel we shouldn't put an upper bound (we may put an upper bound on the major version number not but the minor version number if they are following semantic versioning).
@richardliaw may have more insights on this.
- not all ray dependency has the same impact radius. grpcio is a critical dependency for ray and ray core. i.e. if it breaks ray won't work, AT ALL.
- grpcio has a track of record breaking things, released and yanked https://pypi.org/project/grpcio/#history
- so we are really deciding taking the risk next grpcio release breaks ray, versus some user failed to use ray with latest grpcio.
that's said, i'm not comfortable follow the existing practices where we don't set grpcio upperbound, without other protection mechanism.
- The baseline is setting upperbound for grpcio, and only allow new version until proved innocent.
- We might brainstorm other protection mechanism, such as work with grpcio team closely, or even vender our own grpcio.
i think the safest way is to set the upperbound to the version we know that's working, and figure out a less strict protection mechansim.
Given we haven't reached an agreement on whether to put a cap or not. Let's add a cap to merge this PR since it's still strictly better than Ray 2.0 and we can discuss long term solution later.
Sounds good.
There are more non-flaky Windows tests failing, taking note here so see if they repeat:
Windows [1/6] (first try):
* //python/ray/tests:test_multiprocessing (timeout)
Windows [1/6] (second try):
* //python/ray/tests:test_multiprocessing (timeout)
* //python/ray/tests:test_asyncio (timeout)
Windows [5/6] (first try):
* //python/ray/tests:test_queue (timeout)
Windows [5/6] (second try):
* //python/ray/tests:test_queue (timeout)
* //python/ray/tests:test_runtime_env_working_dir_3 (failure for something to be GC'd after 20 seconds)
Running things manually on a windows machine:
//python/ray/tests:test_multiprocessing test_task_to_actor_assignment fails on Windows
//python/ray/tests:test_multiprocessing test_callbacks might hang
//python/ray/tests:test_queue passes
//python/ray/tests:test_multiprocessing test_callbacks
hangs for both grpcio==1.43.0 and 1.48.1 so I assume it has to do with my environment?
Is there a way I can replicate the BK environment and iterate faster? It is taking forever to wait for BK
same error message for both grpc versions
(base) C:\Users\Administrator\Downloads\ray-master\ray-master>pytest python\ray\tests\test_multiprocessing.py -k "test_callbacks" -s -v
================================================================= test session starts ==================================================================
platform win32 -- Python 3.7.6, pytest-7.1.3, pluggy-0.13.1 -- c:\programdata\anaconda3\python.exe
cachedir: .pytest_cache
hypothesis profile 'default' -> database=DirectoryBasedExampleDatabase('C:\\Users\\Administrator\\Downloads\\ray-master\\ray-master\\.hypothesis\\examples')
rootdir: C:\Users\Administrator\Downloads\ray-master\ray-master\python
plugins: anyio-3.6.1, hypothesis-5.4.1, arraydiff-0.3, astropy-header-0.1.2, asyncio-0.19.0, doctestplus-0.5.0, openfiles-0.4.0, remotedata-0.3.2
asyncio: mode=strict
collected 20 items / 19 deselected / 1 selected
python\ray\tests\test_multiprocessing.py::test_callbacks starting 4 processes using ray pool
Usage stats collection is enabled. To disable this, run the following command: `ray disable-usage-stats` before starting Ray. See https://docs.ray.io/en/master/cluster/usage-stats.html for more details.
2022-09-22 19:17:11,055 INFO worker.py:1515 -- Started a local Ray instance. View the dashboard at 127.0.0.1:8265
callback_queue.get
(PoolActor pid=7892) 2022-09-22 19:17:20,118 ERROR serialization.py:354 -- No module named 'test_multiprocessing'
(PoolActor pid=7892) Traceback (most recent call last):
(PoolActor pid=7892) File "c:\programdata\anaconda3\lib\site-packages\ray\_private\serialization.py", line 352, in deserialize_objects
(PoolActor pid=7892) obj = self._deserialize_object(data, metadata, object_ref)
(PoolActor pid=7892) File "c:\programdata\anaconda3\lib\site-packages\ray\_private\serialization.py", line 241, in _deserialize_object
(PoolActor pid=7892) return self._deserialize_msgpack_data(data, metadata_fields)
(PoolActor pid=7892) File "c:\programdata\anaconda3\lib\site-packages\ray\_private\serialization.py", line 196, in _deserialize_msgpack_data
(PoolActor pid=7892) python_objects = self._deserialize_pickle5_data(pickle5_data)
(PoolActor pid=7892) File "c:\programdata\anaconda3\lib\site-packages\ray\_private\serialization.py", line 186, in _deserialize_pickle5_data
(PoolActor pid=7892) obj = pickle.loads(in_band)
(PoolActor pid=7892) ModuleNotFoundError: No module named 'test_multiprocessing'
Thanks for working on this! Just to understand - besides the missing timeout in the tests, there were no other changes necessary to support grpc 1.48?
Thanks for working on this! Just to understand - besides the missing timeout in the tests, there were no other changes necessary to support grpc 1.48?
Yep! I believe we prohibited this version because it was causing the hang; since 1.48.0 was yanked and 1.48.1 fixes the issue, we can simply allow it.
I'm increasing the timeouts on all of the tests that I've seen become flaky on Windows because of this change.
test_reference_counting
test_multiprocessing
test_asyncio
test_queue
test_runtime_env_working_dir_3
I have a lint fix but waiting for windows builds to succeed before I push
windows tests passed