ray icon indicating copy to clipboard operation
ray copied to clipboard

Widening range of grpcio versions allowed.

Open cadedaniel opened this issue 2 years ago • 9 comments

See https://github.com/ray-project/ray/issues/27299 for context.

script.py

Testing with the following code:

#!/usr/bin/env python3
import ray
ray.init()


@ray.remote
def task(argument):
    import grpc
    import platform
    print(argument, grpc.__version__, platform.python_version())

ray.get(task.remote('Hello world'))

python/grpcio versions tested on above script

grpcio==1.43.0 python==3.10.4
grpcio==1.44.0 python==3.10.4
grpcio==1.45.0 python==3.10.4
grpcio==1.46.0 python==3.10.4
grpcio==1.47.0 python==3.10.4
grpcio==1.48.1 python==3.10.4
grpcio==1.43.0 python==3.6.13
grpcio==1.44.0 python==3.6.13
grpcio==1.45.0 python==3.6.13
grpcio==1.46.0 python==3.6.13
grpcio==1.47.0 python==3.6.13
grpcio==1.48.1 python==3.6.13
grpcio==1.43.0 python==3.7.13
grpcio==1.44.0 python==3.7.13
grpcio==1.45.0 python==3.7.13
grpcio==1.46.0 python==3.7.13
grpcio==1.47.0 python==3.7.13
grpcio==1.48.1 python==3.7.13
grpcio==1.43.0 python==3.8.13
grpcio==1.44.0 python==3.8.13
grpcio==1.45.0 python==3.8.13
grpcio==1.46.0 python==3.8.13
grpcio==1.47.0 python==3.8.13
grpcio==1.48.1 python==3.8.13
grpcio==1.43.0 python==3.9.13
grpcio==1.44.0 python==3.9.13
grpcio==1.45.0 python==3.9.13
grpcio==1.46.0 python==3.9.13
grpcio==1.47.0 python==3.9.13
grpcio==1.48.1 python==3.9.13

test code

#!/usr/bin/env bash

set -e

source /home/ray/anaconda3/etc/profile.d/conda.sh

pr_commit="0193a19226c29c9988760114d67f6ea9af99f9e7"

ray_wheels="$(aws s3 ls s3://ray-ci-artifact-pr-public/$pr_commit/tmp/artifacts/.whl/ | grep -v 'cpp' | awk '{print $4}')"
grpcio_versions="1.43 1.44 1.45 1.46 1.47 1.48.1"

for ray_wheel in $ray_wheels; do

    conda_create_cmd=$(echo $ray_wheel | sed 's/-/ /'g | awk '{print $3}' | sed 's/cp//g' | sed 's/3/3\./g' | sed 's/^/conda create -n temp python=/g')
    $conda_create_cmd --yes
    conda activate temp
    
    for grpcio_version in $grpcio_versions; do
        printf "Uninstalling\n"
        pip uninstall grpcio -y
        pip uninstall ray -y

        printf "Installing grpcio_version $grpcio_version\n"
        pip install grpcio==$grpcio_version

        printf "Installing Ray wheel $ray_wheel"
        pip install "https://ray-ci-artifact-pr-public.s3.us-west-2.amazonaws.com/$pr_commit/tmp/artifacts/.whl/$ray_wheel"

        printf "Running script on grpcio_version $grpcio_version\n"
        ./script.py
    done
done

cadedaniel avatar Sep 19 '22 21:09 cadedaniel

@jjyao what is the best way to test the setup.py installation process? I need to test python<3.10 and python>=3.10, is conda envs the best way?

cadedaniel avatar Sep 19 '22 21:09 cadedaniel

what is the best way to test the setup.py installation process? I need to test python<3.10 and python>=3.10, is conda envs the best way?

Yea, I would use conda for that.

jjyao avatar Sep 19 '22 22:09 jjyao

Could you pull the master head? Want to make sure failed tests are unrelated.

jjyao avatar Sep 20 '22 03:09 jjyao

Notes on windows tests:

  • The test FAILED ::test_scheduling_class_depth[ray_start_regular0] failed on windows three times, so unlikely to be a flaky timeout.
  • tests:test_scheduling_asan looks like it's taking longer. On the third attempt of Window 5/6, it fails due to timeout, the second attempt it takes longer than expected.

cadedaniel avatar Sep 21 '22 03:09 cadedaniel

hmm, we should still set the upper bound? for core dependencies, we should only approve a version until it's proven innocent.

cc @jjyao

cadedaniel avatar Sep 21 '22 17:09 cadedaniel

hmm, we should still set the upper bound? for core dependencies, we should only approve a version until it's proven innocent.

@scv119 This is not the strategy we are following due to the nature of Ray as a library (if you look at setup.py, we don't have upper bounds for most of (core) dependencies). Ray shouldn't prevent users from using the latest versions of dependencies (they may want to use that for some reason like bug fix) that might come out after Ray's release.

jjyao avatar Sep 21 '22 17:09 jjyao

hmm this (setting upper bound) has been the practical strategy we have been following for grpcio right (and other core dependencies)? Empirically, for grpcio, not setting upper bound has caused serious problems (i.e. ray hangs), comparing to the issues where users are not able to upgrade to the latest grpcio.

scv119 avatar Sep 21 '22 17:09 scv119

grpcio doesn't have upper bound by default as well. I think what we did in the past was forcing an upper bound when we discover an issue and lift the upper bound afterwards after the issue is fixed. (this is what we normally do for other core dependencies as well).

The strategy I mentioned is a general strategy we apply to all of our dependencies. Whether we want to make an exception for grpcio is a separate story but I don't think it should apply to all dependencies.

Personally I still feel we shouldn't put an upper bound (we may put an upper bound on the major version number not but the minor version number if they are following semantic versioning).

@richardliaw may have more insights on this.

jjyao avatar Sep 21 '22 18:09 jjyao

  • not all ray dependency has the same impact radius. grpcio is a critical dependency for ray and ray core. i.e. if it breaks ray won't work, AT ALL.
  • grpcio has a track of record breaking things, released and yanked https://pypi.org/project/grpcio/#history
  • so we are really deciding taking the risk next grpcio release breaks ray, versus some user failed to use ray with latest grpcio.

that's said, i'm not comfortable follow the existing practices where we don't set grpcio upperbound, without other protection mechanism.

  • The baseline is setting upperbound for grpcio, and only allow new version until proved innocent.
  • We might brainstorm other protection mechanism, such as work with grpcio team closely, or even vender our own grpcio.

i think the safest way is to set the upperbound to the version we know that's working, and figure out a less strict protection mechansim.

scv119 avatar Sep 21 '22 18:09 scv119

Given we haven't reached an agreement on whether to put a cap or not. Let's add a cap to merge this PR since it's still strictly better than Ray 2.0 and we can discuss long term solution later.

jjyao avatar Sep 22 '22 15:09 jjyao

Sounds good.

There are more non-flaky Windows tests failing, taking note here so see if they repeat:

Windows [1/6] (first try):
* //python/ray/tests:test_multiprocessing (timeout)
Windows [1/6] (second try):
* //python/ray/tests:test_multiprocessing  (timeout)
* //python/ray/tests:test_asyncio  (timeout)

Windows [5/6] (first try):
* //python/ray/tests:test_queue (timeout)

Windows [5/6] (second try):
* //python/ray/tests:test_queue (timeout)
* //python/ray/tests:test_runtime_env_working_dir_3 (failure for something to be GC'd after 20 seconds)

Running things manually on a windows machine:

//python/ray/tests:test_multiprocessing test_task_to_actor_assignment fails on Windows
//python/ray/tests:test_multiprocessing test_callbacks might hang
//python/ray/tests:test_queue passes

cadedaniel avatar Sep 22 '22 16:09 cadedaniel

//python/ray/tests:test_multiprocessing test_callbacks hangs for both grpcio==1.43.0 and 1.48.1 so I assume it has to do with my environment?

Is there a way I can replicate the BK environment and iterate faster? It is taking forever to wait for BK

same error message for both grpc versions

(base) C:\Users\Administrator\Downloads\ray-master\ray-master>pytest python\ray\tests\test_multiprocessing.py -k "test_callbacks" -s -v
================================================================= test session starts ==================================================================
platform win32 -- Python 3.7.6, pytest-7.1.3, pluggy-0.13.1 -- c:\programdata\anaconda3\python.exe
cachedir: .pytest_cache
hypothesis profile 'default' -> database=DirectoryBasedExampleDatabase('C:\\Users\\Administrator\\Downloads\\ray-master\\ray-master\\.hypothesis\\examples')
rootdir: C:\Users\Administrator\Downloads\ray-master\ray-master\python
plugins: anyio-3.6.1, hypothesis-5.4.1, arraydiff-0.3, astropy-header-0.1.2, asyncio-0.19.0, doctestplus-0.5.0, openfiles-0.4.0, remotedata-0.3.2
asyncio: mode=strict
collected 20 items / 19 deselected / 1 selected

python\ray\tests\test_multiprocessing.py::test_callbacks starting 4 processes using ray pool
Usage stats collection is enabled. To disable this, run the following command: `ray disable-usage-stats` before starting Ray. See https://docs.ray.io/en/master/cluster/usage-stats.html for more details.
2022-09-22 19:17:11,055 INFO worker.py:1515 -- Started a local Ray instance. View the dashboard at 127.0.0.1:8265
callback_queue.get
(PoolActor pid=7892) 2022-09-22 19:17:20,118    ERROR serialization.py:354 -- No module named 'test_multiprocessing'
(PoolActor pid=7892) Traceback (most recent call last):
(PoolActor pid=7892)   File "c:\programdata\anaconda3\lib\site-packages\ray\_private\serialization.py", line 352, in deserialize_objects
(PoolActor pid=7892)     obj = self._deserialize_object(data, metadata, object_ref)
(PoolActor pid=7892)   File "c:\programdata\anaconda3\lib\site-packages\ray\_private\serialization.py", line 241, in _deserialize_object
(PoolActor pid=7892)     return self._deserialize_msgpack_data(data, metadata_fields)
(PoolActor pid=7892)   File "c:\programdata\anaconda3\lib\site-packages\ray\_private\serialization.py", line 196, in _deserialize_msgpack_data
(PoolActor pid=7892)     python_objects = self._deserialize_pickle5_data(pickle5_data)
(PoolActor pid=7892)   File "c:\programdata\anaconda3\lib\site-packages\ray\_private\serialization.py", line 186, in _deserialize_pickle5_data
(PoolActor pid=7892)     obj = pickle.loads(in_band)
(PoolActor pid=7892) ModuleNotFoundError: No module named 'test_multiprocessing'

cadedaniel avatar Sep 22 '22 19:09 cadedaniel

Thanks for working on this! Just to understand - besides the missing timeout in the tests, there were no other changes necessary to support grpc 1.48?

h-vetinari avatar Sep 23 '22 14:09 h-vetinari

Thanks for working on this! Just to understand - besides the missing timeout in the tests, there were no other changes necessary to support grpc 1.48?

Yep! I believe we prohibited this version because it was causing the hang; since 1.48.0 was yanked and 1.48.1 fixes the issue, we can simply allow it.

cadedaniel avatar Sep 23 '22 17:09 cadedaniel

I'm increasing the timeouts on all of the tests that I've seen become flaky on Windows because of this change.

test_reference_counting
test_multiprocessing
test_asyncio
test_queue
test_runtime_env_working_dir_3

cadedaniel avatar Sep 23 '22 21:09 cadedaniel

I have a lint fix but waiting for windows builds to succeed before I push

cadedaniel avatar Sep 23 '22 22:09 cadedaniel

windows tests passed

jjyao avatar Sep 25 '22 02:09 jjyao