modin TEST: CI: flaky ray windows failure: "System error: Unknown error"

It happened in the first two CI runs at a particular commit of #4881, as here: https://github.com/modin-project/modin/runs/8115897696?check_suite_focus=true

Aug 31 '22 15:08 mvashishtha

In both failures I see

Failed to create runtime environment {"envVars": {"__MODIN_AUTOIMPORT_PANDAS__": "1"}} because the Ray agent couldn't be started due to the port conflict. See `dashboard_agent.log` for more details. To solve the problem, start Ray with a hard-coded agent port. `ray start --dashboard-agent-grpc-port [port]` and make sure the port is not used by other processes.

I wonder whether #4562 is related-- we also wondered there whether there was a conflict between ray clusters.

Aug 31 '22 15:08 mvashishtha

I got an other one today in test-windows (3.8, ray, modin/pandas/test/test_series.py): https://github.com/modin-project/modin/actions/runs/3141086969/jobs/5104676150#step:7:220404

again "failed to create runtime environment."

partial stack trace (whole thing seems to be extremely long because many cases failed)

================================== FAILURES ===================================
___________________________ test_to_frame[int_data] ___________________________

data = {'col1': array([94, 30, 39, 47, 50, 42, 97, 54, 87, 48, 89, 79, 56, 76, 14, 26, 67,
       79, 63, 93, 29, 35, 66, 85,...2, 67, 73, 24, 94, 66,  1,
       88, 48, 69, 25, 71, 98, 26, 88, 17, 53,  0, 60,  2, 67, 40, 36, 50,
       14]), ...}

    @pytest.mark.parametrize("data", test_data_values, ids=test_data_keys)
    def test_to_frame(data):
        modin_series, pandas_series = create_test_series(data)
>       df_equals(modin_series.to_frame(name="miao"), pandas_series.to_frame(name="miao"))

modin\pandas\test\test_series.py:213: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
modin\pandas\test\utils.py:578: in df_equals
    df1 = to_pandas(df1)
modin\utils.py:451: in to_pandas
    return modin_obj._to_pandas()
modin\logging\logger_decorator.py:128: in run_and_log
    return obj(*args, **kwargs)
modin\pandas\dataframe.py:2822: in _to_pandas
    return self._query_compiler.to_pandas()
modin\logging\logger_decorator.py:128: in run_and_log
    return obj(*args, **kwargs)
modin\core\storage_formats\pandas\query_compiler.py:277: in to_pandas
    return self._modin_frame.to_pandas()
modin\logging\logger_decorator.py:128: in run_and_log
    return obj(*args, **kwargs)
modin\core\dataframe\pandas\dataframe\dataframe.py:124: in run_f_on_minimally_updated_metadata
    result = f(self, *args, **kwargs)
modin\core\dataframe\pandas\dataframe\dataframe.py:3054: in to_pandas
    df = self._partition_mgr_cls.to_pandas(self._partitions)
modin\logging\logger_decorator.py:128: in run_and_log
    return obj(*args, **kwargs)
modin\core\dataframe\pandas\partitioning\partition_manager.py:[64](https://github.com/modin-project/modin/actions/runs/3141086969/jobs/5104676150#step:7:65)4: in to_pandas
    retrieved_objects = [[obj.to_pandas() for obj in part] for part in partitions]
modin\core\dataframe\pandas\partitioning\partition_manager.py:644: in <listcomp>
    retrieved_objects = [[obj.to_pandas() for obj in part] for part in partitions]
modin\core\dataframe\pandas\partitioning\partition_manager.py:644: in <listcomp>
    retrieved_objects = [[obj.to_pandas() for obj in part] for part in partitions]
modin\core\dataframe\pandas\partitioning\partition.py:145: in to_pandas
    dataframe = self.get()
modin\core\execution\ray\implementations\pandas_on_ray\partitioning\partition.py:81: in get
    result = RayWrapper.materialize(self._data)
modin\core\execution\ray\common\engine_wrapper.py:92: in materialize
    return ray.get(obj_id)
C:\Miniconda3\envs\modin\lib\site-packages\ray\_private\client_mode_hook.py:105: in wrapper
    return func(*args, **kwargs)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

object_refs = [ObjectRef(c8ef45ccd0112571ffffffffffffffffffffffff0100000001000000)]

    @PublicAPI
    @client_mode_hook(auto_init=True)
    def get(
        object_refs: Union[ray.ObjectRef, Sequence[ray.ObjectRef]],
        *,
        timeout: Optional[float] = None,
    ) -> Union[Any, List[Any]]:
        """Get a remote object or a list of remote objects from the object store.
    
        This method blocks until the object corresponding to the object ref is
        available in the local object store. If this object is not in the local
        object store, it will be shipped from an object store that has it (once the
        object has been created). If object_refs is a list, then the objects
        corresponding to each object in the list will be returned.
    
        Ordering for an input list of object refs is preserved for each object
        returned. That is, if an object ref to A precedes an object ref to B in the
        input list, then A will precede B in the returned list.
    
        This method will issue a warning if it's running inside async context,
        you can use ``await object_ref`` instead of ``ray.get(object_ref)``. For
        a list of object refs, you can use ``await asyncio.gather(*object_refs)``.
    
        Args:
            object_refs: Object ref of the object to get or a list of object refs
                to get.
            timeout (Optional[float]): The maximum amount of time in seconds to
                wait before returning.
    
        Returns:
            A Python object or a list of Python objects.
    
        Raises:
            GetTimeoutError: A GetTimeoutError is raised if a timeout is set and
                the get takes longer than timeout to return.
            Exception: An exception is raised if the task that created the object
                or that created one of the objects raised an exception.
        """
        worker = global_worker
        worker.check_connected()
    
        if hasattr(worker, "core_worker") and worker.core_worker.current_actor_is_asyncio():
            global blocking_get_inside_async_warned
            if not blocking_get_inside_async_warned:
                logger.warning(
                    "Using blocking ray.get inside async actor. "
                    "This blocks the event loop. Please use `await` "
                    "on object ref with asyncio.gather if you want to "
                    "yield execution to the event loop instead."
                )
                blocking_get_inside_async_warned = True
    
        with profiling.profile("ray.get"):
            is_individual_id = isinstance(object_refs, ray.ObjectRef)
            if is_individual_id:
                object_refs = [object_refs]
    
            if not isinstance(object_refs, list):
                raise ValueError(
                    "'object_refs' must either be an object ref "
                    "or a list of object refs."
                )
    
            # TODO(ujvl): Consider how to allow user to retrieve the ready objects.
            values, debugger_breakpoint = worker.get_objects(object_refs, timeout=timeout)
            for i, value in enumerate(values):
                if isinstance(value, RayError):
                    if isinstance(value, ray.exceptions.ObjectLostError):
                        worker.core_worker.dump_object_store_memory_usage()
                    if isinstance(value, RayTaskError):
                        raise value.as_instanceof_cause()
                    else:
>                       raise value
E                       ray.exceptions.RuntimeEnvSetupError: Failed to setup runtime environment.
E                       Failed to create runtime environment {"env_vars": {"__MODIN_AUTOIMPORT_PANDAS__": "1"}} because the Ray agent couldn't be started due to the port conflict. See `dashboard_agent.log` for more details. To solve the problem, start Ray with a hard-coded agent port. `ray start --dashboard-agent-grpc-port [port]` and make sure the port is not used by other processes.

C:\Miniconda3\envs\modin\lib\site-packages\ray\_private\worker.py:2277: RuntimeEnvSetupError
---------------------------- Captured stderr call -----------------------------
2022-09-28 08:12:28,751	INFO worker.py:1518 -- Started a local Ray instance.
(pid=) E0928 08:12:30.011000000  6400 src/core/ext/transport/chttp2/server/insecure/server_chttp2.cc:48] {"created":"@1664352750.010000000","description":"No address added out of total 1 resolved","file":"src/core/ext/transport/chttp2/server/chttp2_server.cc","file_line":873,"referenced_errors":[{"created":"@1664352750.010000000","description":"Failed to add port to server","file":"src/core/lib/iomgr/tcp_server_windows.cc","file_line":509,"referenced_errors":[{"created":"@1664352750.010000000","description":"OS Error","file":"src/core/lib/iomgr/tcp_server_windows.cc","file_line":206,"os_error":"Only one usage of each socket address (protocol/network address/port) is normally permitted.\r\n","syscall":"bind","wsa_error":10048}]}]}
(pid=) [2022-09-28 08:12:32,113 E 3772 [65](https://github.com/modin-project/modin/actions/runs/3141086969/jobs/5104676150#step:7:66)04] (raylet.exe) agent_manager.cc:1[85](https://github.com/modin-project/modin/actions/runs/3141086969/jobs/5104676150#step:7:86): Failed to create runtime environment {"env_vars": {"__MODIN_AUTOIMPORT_PANDAS__": "1"}} because the Ray agent couldn't be started due to the port conflict. See `dashboard_agent.log` for more details. To solve the problem, start Ray with a hard-coded agent port. `ray start --dashboard-agent-grpc-port [port]` and make sure the port is not used by other processes.

Sep 28 '22 12:09 mvashishtha

@simon-mo @modin-project/modin-ray @mattip can you tell what's going wrong? Should we add some arbitrary dashboard agent port, e.g. with --dashboard-agent-grpc-port 999, to out

complete error log from the run I posted about here is attached. 11_test-windows (3.8, ray, modinpandastesttest_series.py).txt

Oct 07 '22 20:10 mvashishtha

We don't need the dashboard at all. I'll see whether--include-dashboard=False works.

Oct 07 '22 20:10 mvashishtha

dashboard-agent is not a subcomponent of dashboard. sorry about the confusion in naming and the ray core is working on updating the terminology.

the agent currently cannot be disabled. this only happens when you are trying to run multiple ray clusters in the same machine right? I believe @iycheng had look into before. Yi, do you have suggestion on properly running multiple Ray clusters for testing on the machine?

Oct 07 '22 21:10 simon-mo

@simon-mo we shouldn't be starting multiple ray clusters here. AFAICT this action should run on a fresh VM, and the only time we should initialize ray is when we run the test command.

Each run should be on a fresh VM, according to the documentation here:

If you use a GitHub-hosted runner, each job runs in a fresh instance of a runner image specified by runs-on

Oct 07 '22 21:10 mvashishtha

I am lowering the priority since the problem has not been observed for a long time.

Nov 08 '23 23:11 anmyachev