TEST: CI: flaky ray windows failure: "System error: Unknown error"
It happened in the first two CI runs at a particular commit of #4881, as here: https://github.com/modin-project/modin/runs/8115897696?check_suite_focus=true
In both failures I see
Failed to create runtime environment {"envVars": {"__MODIN_AUTOIMPORT_PANDAS__": "1"}} because the Ray agent couldn't be started due to the port conflict. See `dashboard_agent.log` for more details. To solve the problem, start Ray with a hard-coded agent port. `ray start --dashboard-agent-grpc-port [port]` and make sure the port is not used by other processes.
I wonder whether #4562 is related-- we also wondered there whether there was a conflict between ray clusters.
I got an other one today in test-windows (3.8, ray, modin/pandas/test/test_series.py): https://github.com/modin-project/modin/actions/runs/3141086969/jobs/5104676150#step:7:220404
again "failed to create runtime environment."
partial stack trace (whole thing seems to be extremely long because many cases failed)
================================== FAILURES ===================================
___________________________ test_to_frame[int_data] ___________________________
data = {'col1': array([94, 30, 39, 47, 50, 42, 97, 54, 87, 48, 89, 79, 56, 76, 14, 26, 67,
79, 63, 93, 29, 35, 66, 85,...2, 67, 73, 24, 94, 66, 1,
88, 48, 69, 25, 71, 98, 26, 88, 17, 53, 0, 60, 2, 67, 40, 36, 50,
14]), ...}
@pytest.mark.parametrize("data", test_data_values, ids=test_data_keys)
def test_to_frame(data):
modin_series, pandas_series = create_test_series(data)
> df_equals(modin_series.to_frame(name="miao"), pandas_series.to_frame(name="miao"))
modin\pandas\test\test_series.py:213:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
modin\pandas\test\utils.py:578: in df_equals
df1 = to_pandas(df1)
modin\utils.py:451: in to_pandas
return modin_obj._to_pandas()
modin\logging\logger_decorator.py:128: in run_and_log
return obj(*args, **kwargs)
modin\pandas\dataframe.py:2822: in _to_pandas
return self._query_compiler.to_pandas()
modin\logging\logger_decorator.py:128: in run_and_log
return obj(*args, **kwargs)
modin\core\storage_formats\pandas\query_compiler.py:277: in to_pandas
return self._modin_frame.to_pandas()
modin\logging\logger_decorator.py:128: in run_and_log
return obj(*args, **kwargs)
modin\core\dataframe\pandas\dataframe\dataframe.py:124: in run_f_on_minimally_updated_metadata
result = f(self, *args, **kwargs)
modin\core\dataframe\pandas\dataframe\dataframe.py:3054: in to_pandas
df = self._partition_mgr_cls.to_pandas(self._partitions)
modin\logging\logger_decorator.py:128: in run_and_log
return obj(*args, **kwargs)
modin\core\dataframe\pandas\partitioning\partition_manager.py:[64](https://github.com/modin-project/modin/actions/runs/3141086969/jobs/5104676150#step:7:65)4: in to_pandas
retrieved_objects = [[obj.to_pandas() for obj in part] for part in partitions]
modin\core\dataframe\pandas\partitioning\partition_manager.py:644: in <listcomp>
retrieved_objects = [[obj.to_pandas() for obj in part] for part in partitions]
modin\core\dataframe\pandas\partitioning\partition_manager.py:644: in <listcomp>
retrieved_objects = [[obj.to_pandas() for obj in part] for part in partitions]
modin\core\dataframe\pandas\partitioning\partition.py:145: in to_pandas
dataframe = self.get()
modin\core\execution\ray\implementations\pandas_on_ray\partitioning\partition.py:81: in get
result = RayWrapper.materialize(self._data)
modin\core\execution\ray\common\engine_wrapper.py:92: in materialize
return ray.get(obj_id)
C:\Miniconda3\envs\modin\lib\site-packages\ray\_private\client_mode_hook.py:105: in wrapper
return func(*args, **kwargs)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
object_refs = [ObjectRef(c8ef45ccd0112571ffffffffffffffffffffffff0100000001000000)]
@PublicAPI
@client_mode_hook(auto_init=True)
def get(
object_refs: Union[ray.ObjectRef, Sequence[ray.ObjectRef]],
*,
timeout: Optional[float] = None,
) -> Union[Any, List[Any]]:
"""Get a remote object or a list of remote objects from the object store.
This method blocks until the object corresponding to the object ref is
available in the local object store. If this object is not in the local
object store, it will be shipped from an object store that has it (once the
object has been created). If object_refs is a list, then the objects
corresponding to each object in the list will be returned.
Ordering for an input list of object refs is preserved for each object
returned. That is, if an object ref to A precedes an object ref to B in the
input list, then A will precede B in the returned list.
This method will issue a warning if it's running inside async context,
you can use ``await object_ref`` instead of ``ray.get(object_ref)``. For
a list of object refs, you can use ``await asyncio.gather(*object_refs)``.
Args:
object_refs: Object ref of the object to get or a list of object refs
to get.
timeout (Optional[float]): The maximum amount of time in seconds to
wait before returning.
Returns:
A Python object or a list of Python objects.
Raises:
GetTimeoutError: A GetTimeoutError is raised if a timeout is set and
the get takes longer than timeout to return.
Exception: An exception is raised if the task that created the object
or that created one of the objects raised an exception.
"""
worker = global_worker
worker.check_connected()
if hasattr(worker, "core_worker") and worker.core_worker.current_actor_is_asyncio():
global blocking_get_inside_async_warned
if not blocking_get_inside_async_warned:
logger.warning(
"Using blocking ray.get inside async actor. "
"This blocks the event loop. Please use `await` "
"on object ref with asyncio.gather if you want to "
"yield execution to the event loop instead."
)
blocking_get_inside_async_warned = True
with profiling.profile("ray.get"):
is_individual_id = isinstance(object_refs, ray.ObjectRef)
if is_individual_id:
object_refs = [object_refs]
if not isinstance(object_refs, list):
raise ValueError(
"'object_refs' must either be an object ref "
"or a list of object refs."
)
# TODO(ujvl): Consider how to allow user to retrieve the ready objects.
values, debugger_breakpoint = worker.get_objects(object_refs, timeout=timeout)
for i, value in enumerate(values):
if isinstance(value, RayError):
if isinstance(value, ray.exceptions.ObjectLostError):
worker.core_worker.dump_object_store_memory_usage()
if isinstance(value, RayTaskError):
raise value.as_instanceof_cause()
else:
> raise value
E ray.exceptions.RuntimeEnvSetupError: Failed to setup runtime environment.
E Failed to create runtime environment {"env_vars": {"__MODIN_AUTOIMPORT_PANDAS__": "1"}} because the Ray agent couldn't be started due to the port conflict. See `dashboard_agent.log` for more details. To solve the problem, start Ray with a hard-coded agent port. `ray start --dashboard-agent-grpc-port [port]` and make sure the port is not used by other processes.
C:\Miniconda3\envs\modin\lib\site-packages\ray\_private\worker.py:2277: RuntimeEnvSetupError
---------------------------- Captured stderr call -----------------------------
2022-09-28 08:12:28,751 INFO worker.py:1518 -- Started a local Ray instance.
(pid=) E0928 08:12:30.011000000 6400 src/core/ext/transport/chttp2/server/insecure/server_chttp2.cc:48] {"created":"@1664352750.010000000","description":"No address added out of total 1 resolved","file":"src/core/ext/transport/chttp2/server/chttp2_server.cc","file_line":873,"referenced_errors":[{"created":"@1664352750.010000000","description":"Failed to add port to server","file":"src/core/lib/iomgr/tcp_server_windows.cc","file_line":509,"referenced_errors":[{"created":"@1664352750.010000000","description":"OS Error","file":"src/core/lib/iomgr/tcp_server_windows.cc","file_line":206,"os_error":"Only one usage of each socket address (protocol/network address/port) is normally permitted.\r\n","syscall":"bind","wsa_error":10048}]}]}
(pid=) [2022-09-28 08:12:32,113 E 3772 [65](https://github.com/modin-project/modin/actions/runs/3141086969/jobs/5104676150#step:7:66)04] (raylet.exe) agent_manager.cc:1[85](https://github.com/modin-project/modin/actions/runs/3141086969/jobs/5104676150#step:7:86): Failed to create runtime environment {"env_vars": {"__MODIN_AUTOIMPORT_PANDAS__": "1"}} because the Ray agent couldn't be started due to the port conflict. See `dashboard_agent.log` for more details. To solve the problem, start Ray with a hard-coded agent port. `ray start --dashboard-agent-grpc-port [port]` and make sure the port is not used by other processes.
@simon-mo @modin-project/modin-ray @mattip can you tell what's going wrong? Should we add some arbitrary dashboard agent port, e.g. with --dashboard-agent-grpc-port 999, to out
complete error log from the run I posted about here is attached. 11_test-windows (3.8, ray, modinpandastesttest_series.py).txt
We don't need the dashboard at all. I'll see whether--include-dashboard=False works.
dashboard-agent is not a subcomponent of dashboard. sorry about the confusion in naming and the ray core is working on updating the terminology.
the agent currently cannot be disabled. this only happens when you are trying to run multiple ray clusters in the same machine right? I believe @iycheng had look into before. Yi, do you have suggestion on properly running multiple Ray clusters for testing on the machine?
@simon-mo we shouldn't be starting multiple ray clusters here. AFAICT this action should run on a fresh VM, and the only time we should initialize ray is when we run the test command.
Each run should be on a fresh VM, according to the documentation here:
If you use a GitHub-hosted runner, each job runs in a fresh instance of a runner image specified by runs-on
I am lowering the priority since the problem has not been observed for a long time.