ray Widening range of grpcio versions allowed.

See https://github.com/ray-project/ray/issues/27299 for context.

script.py

Testing with the following code:

#!/usr/bin/env python3
import ray
ray.init()


@ray.remote
def task(argument):
    import grpc
    import platform
    print(argument, grpc.__version__, platform.python_version())

ray.get(task.remote('Hello world'))

python/grpcio versions tested on above script

grpcio==1.43.0 python==3.10.4
grpcio==1.44.0 python==3.10.4
grpcio==1.45.0 python==3.10.4
grpcio==1.46.0 python==3.10.4
grpcio==1.47.0 python==3.10.4
grpcio==1.48.1 python==3.10.4
grpcio==1.43.0 python==3.6.13
grpcio==1.44.0 python==3.6.13
grpcio==1.45.0 python==3.6.13
grpcio==1.46.0 python==3.6.13
grpcio==1.47.0 python==3.6.13
grpcio==1.48.1 python==3.6.13
grpcio==1.43.0 python==3.7.13
grpcio==1.44.0 python==3.7.13
grpcio==1.45.0 python==3.7.13
grpcio==1.46.0 python==3.7.13
grpcio==1.47.0 python==3.7.13
grpcio==1.48.1 python==3.7.13
grpcio==1.43.0 python==3.8.13
grpcio==1.44.0 python==3.8.13
grpcio==1.45.0 python==3.8.13
grpcio==1.46.0 python==3.8.13
grpcio==1.47.0 python==3.8.13
grpcio==1.48.1 python==3.8.13
grpcio==1.43.0 python==3.9.13
grpcio==1.44.0 python==3.9.13
grpcio==1.45.0 python==3.9.13
grpcio==1.46.0 python==3.9.13
grpcio==1.47.0 python==3.9.13
grpcio==1.48.1 python==3.9.13

test code

#!/usr/bin/env bash

set -e

source /home/ray/anaconda3/etc/profile.d/conda.sh

pr_commit="0193a19226c29c9988760114d67f6ea9af99f9e7"

ray_wheels="$(aws s3 ls s3://ray-ci-artifact-pr-public/$pr_commit/tmp/artifacts/.whl/ | grep -v 'cpp' | awk '{print $4}')"
grpcio_versions="1.43 1.44 1.45 1.46 1.47 1.48.1"

for ray_wheel in $ray_wheels; do

    conda_create_cmd=$(echo $ray_wheel | sed 's/-/ /'g | awk '{print $3}' | sed 's/cp//g' | sed 's/3/3\./g' | sed 's/^/conda create -n temp python=/g')
    $conda_create_cmd --yes
    conda activate temp
    
    for grpcio_version in $grpcio_versions; do
        printf "Uninstalling\n"
        pip uninstall grpcio -y
        pip uninstall ray -y

        printf "Installing grpcio_version $grpcio_version\n"
        pip install grpcio==$grpcio_version

        printf "Installing Ray wheel $ray_wheel"
        pip install "https://ray-ci-artifact-pr-public.s3.us-west-2.amazonaws.com/$pr_commit/tmp/artifacts/.whl/$ray_wheel"

        printf "Running script on grpcio_version $grpcio_version\n"
        ./script.py
    done
done

Sep 19 '22 21:09 cadedaniel

@jjyao what is the best way to test the setup.py installation process? I need to test python<3.10 and python>=3.10, is conda envs the best way?

Sep 19 '22 21:09 cadedaniel

what is the best way to test the setup.py installation process? I need to test python<3.10 and python>=3.10, is conda envs the best way?

Yea, I would use conda for that.

Sep 19 '22 22:09 jjyao

Could you pull the master head? Want to make sure failed tests are unrelated.

Sep 20 '22 03:09 jjyao

Notes on windows tests:

The test FAILED ::test_scheduling_class_depth[ray_start_regular0] failed on windows three times, so unlikely to be a flaky timeout.
tests:test_scheduling_asan looks like it's taking longer. On the third attempt of Window 5/6, it fails due to timeout, the second attempt it takes longer than expected.

Sep 21 '22 03:09 cadedaniel

hmm, we should still set the upper bound? for core dependencies, we should only approve a version until it's proven innocent.

cc @jjyao

Sep 21 '22 17:09 cadedaniel

hmm, we should still set the upper bound? for core dependencies, we should only approve a version until it's proven innocent.

@scv119 This is not the strategy we are following due to the nature of Ray as a library (if you look at setup.py, we don't have upper bounds for most of (core) dependencies). Ray shouldn't prevent users from using the latest versions of dependencies (they may want to use that for some reason like bug fix) that might come out after Ray's release.

Sep 21 '22 17:09 jjyao

hmm this (setting upper bound) has been the practical strategy we have been following for grpcio right (and other core dependencies)? Empirically, for grpcio, not setting upper bound has caused serious problems (i.e. ray hangs), comparing to the issues where users are not able to upgrade to the latest grpcio.

Sep 21 '22 17:09 scv119

grpcio doesn't have upper bound by default as well. I think what we did in the past was forcing an upper bound when we discover an issue and lift the upper bound afterwards after the issue is fixed. (this is what we normally do for other core dependencies as well).

The strategy I mentioned is a general strategy we apply to all of our dependencies. Whether we want to make an exception for grpcio is a separate story but I don't think it should apply to all dependencies.

Personally I still feel we shouldn't put an upper bound (we may put an upper bound on the major version number not but the minor version number if they are following semantic versioning).

@richardliaw may have more insights on this.

Sep 21 '22 18:09 jjyao

not all ray dependency has the same impact radius. grpcio is a critical dependency for ray and ray core. i.e. if it breaks ray won't work, AT ALL.
grpcio has a track of record breaking things, released and yanked https://pypi.org/project/grpcio/#history
so we are really deciding taking the risk next grpcio release breaks ray, versus some user failed to use ray with latest grpcio.

that's said, i'm not comfortable follow the existing practices where we don't set grpcio upperbound, without other protection mechanism.

The baseline is setting upperbound for grpcio, and only allow new version until proved innocent.
We might brainstorm other protection mechanism, such as work with grpcio team closely, or even vender our own grpcio.

i think the safest way is to set the upperbound to the version we know that's working, and figure out a less strict protection mechansim.

Sep 21 '22 18:09 scv119

Given we haven't reached an agreement on whether to put a cap or not. Let's add a cap to merge this PR since it's still strictly better than Ray 2.0 and we can discuss long term solution later.

Sep 22 '22 15:09 jjyao

Sounds good.

There are more non-flaky Windows tests failing, taking note here so see if they repeat:

Windows [1/6] (first try):
* //python/ray/tests:test_multiprocessing (timeout)
Windows [1/6] (second try):
* //python/ray/tests:test_multiprocessing  (timeout)
* //python/ray/tests:test_asyncio  (timeout)

Windows [5/6] (first try):
* //python/ray/tests:test_queue (timeout)

Windows [5/6] (second try):
* //python/ray/tests:test_queue (timeout)
* //python/ray/tests:test_runtime_env_working_dir_3 (failure for something to be GC'd after 20 seconds)

Running things manually on a windows machine:

//python/ray/tests:test_multiprocessing test_task_to_actor_assignment fails on Windows
//python/ray/tests:test_multiprocessing test_callbacks might hang
//python/ray/tests:test_queue passes

Sep 22 '22 16:09 cadedaniel

//python/ray/tests:test_multiprocessing test_callbacks hangs for both grpcio==1.43.0 and 1.48.1 so I assume it has to do with my environment?

Is there a way I can replicate the BK environment and iterate faster? It is taking forever to wait for BK

same error message for both grpc versions

(base) C:\Users\Administrator\Downloads\ray-master\ray-master>pytest python\ray\tests\test_multiprocessing.py -k "test_callbacks" -s -v
================================================================= test session starts ==================================================================
platform win32 -- Python 3.7.6, pytest-7.1.3, pluggy-0.13.1 -- c:\programdata\anaconda3\python.exe
cachedir: .pytest_cache
hypothesis profile 'default' -> database=DirectoryBasedExampleDatabase('C:\\Users\\Administrator\\Downloads\\ray-master\\ray-master\\.hypothesis\\examples')
rootdir: C:\Users\Administrator\Downloads\ray-master\ray-master\python
plugins: anyio-3.6.1, hypothesis-5.4.1, arraydiff-0.3, astropy-header-0.1.2, asyncio-0.19.0, doctestplus-0.5.0, openfiles-0.4.0, remotedata-0.3.2
asyncio: mode=strict
collected 20 items / 19 deselected / 1 selected

python\ray\tests\test_multiprocessing.py::test_callbacks starting 4 processes using ray pool
Usage stats collection is enabled. To disable this, run the following command: `ray disable-usage-stats` before starting Ray. See https://docs.ray.io/en/master/cluster/usage-stats.html for more details.
2022-09-22 19:17:11,055 INFO worker.py:1515 -- Started a local Ray instance. View the dashboard at 127.0.0.1:8265
callback_queue.get
(PoolActor pid=7892) 2022-09-22 19:17:20,118    ERROR serialization.py:354 -- No module named 'test_multiprocessing'
(PoolActor pid=7892) Traceback (most recent call last):
(PoolActor pid=7892)   File "c:\programdata\anaconda3\lib\site-packages\ray\_private\serialization.py", line 352, in deserialize_objects
(PoolActor pid=7892)     obj = self._deserialize_object(data, metadata, object_ref)
(PoolActor pid=7892)   File "c:\programdata\anaconda3\lib\site-packages\ray\_private\serialization.py", line 241, in _deserialize_object
(PoolActor pid=7892)     return self._deserialize_msgpack_data(data, metadata_fields)
(PoolActor pid=7892)   File "c:\programdata\anaconda3\lib\site-packages\ray\_private\serialization.py", line 196, in _deserialize_msgpack_data
(PoolActor pid=7892)     python_objects = self._deserialize_pickle5_data(pickle5_data)
(PoolActor pid=7892)   File "c:\programdata\anaconda3\lib\site-packages\ray\_private\serialization.py", line 186, in _deserialize_pickle5_data
(PoolActor pid=7892)     obj = pickle.loads(in_band)
(PoolActor pid=7892) ModuleNotFoundError: No module named 'test_multiprocessing'

Sep 22 '22 19:09 cadedaniel

Thanks for working on this! Just to understand - besides the missing timeout in the tests, there were no other changes necessary to support grpc 1.48?

Sep 23 '22 14:09 h-vetinari

Thanks for working on this! Just to understand - besides the missing timeout in the tests, there were no other changes necessary to support grpc 1.48?

Yep! I believe we prohibited this version because it was causing the hang; since 1.48.0 was yanked and 1.48.1 fixes the issue, we can simply allow it.

Sep 23 '22 17:09 cadedaniel

I'm increasing the timeouts on all of the tests that I've seen become flaky on Windows because of this change.

test_reference_counting
test_multiprocessing
test_asyncio
test_queue
test_runtime_env_working_dir_3

Sep 23 '22 21:09 cadedaniel

I have a lint fix but waiting for windows builds to succeed before I push

Sep 23 '22 22:09 cadedaniel

windows tests passed

Sep 25 '22 02:09 jjyao

ray ray copied to clipboard

Widening range of grpcio versions allowed.

script.py

python/grpcio versions tested on above script

test code

ray
ray copied to clipboard