ray
ray copied to clipboard
[Core] ray.init() stuck at "Started a local Ray instance."
What happened + What you expected to happen
When I am trying to do a ray.init(logging_level='debug')
on a windows server it gets stuck at Started a local Ray instance
Versions / Dependencies
Ray version[RLLIB]: 2.5.1
OS: Windows 10
Python: 3.8.5
Reproduction script
I am just doing
import ray
ray.init(logging_level='debug')
Debug message I am getting is
2023-07-12 17:07:30,736 DEBUG worker.py:1412 -- Could not import resource module (on Windows)
2023-07-12 17:07:37,350 DEBUG node.py:1149 -- Process STDOUT and STDERR is being redirected to C:\Users\paula\AppData\Local\Temp\3\ray\session_2023-07-12_17-07-37_304650_56384\logs.
2023-07-12 17:07:37,473 DEBUG node.py:605 -- Failed to send request to gcs, reconnecting. Error failed to connect to all addresses
2023-07-12 17:07:41,530 DEBUG node.py:1187 -- Process STDOUT and STDERR is being redirected to C:\Users\paula\AppData\Local\Temp\3\ray\session_2023-07-12_17-07-37_304650_56384\logs.
2023-07-12 17:07:41,820 DEBUG services.py:1894 -- Determine to start the Plasma object store with 72.38 GB memory using C:\Users\paula\AppData\Local\Temp\3.
2023-07-12 17:07:42,004 INFO worker.py:1636 -- Started a local Ray instance.
Issue Severity
High: It blocks me from completing my task.
cc @mattip can you follow up and triage?
What CPU and how many cores does it have? How much memory? Does running in regular mode, without debug
work?
It is a server with Intel Xenon Gold 6230 CPU @ 2.1 Ghz (80 cores as seen on Task manager, maybe because of multi threading it is showing more) with 384 GB of ram. I tried doing a ray.init() and getting the following
Traceback (most recent call last):
File "F:\........\venvs\G2OP_RLLIB\lib\site-packages\ray\_private\node.py", line 293, in __init__
ray._private.services.wait_for_node(
File "F:\.......\venvs\G2OP_RLLIB\lib\site-packages\ray\_private\services.py", line 459, in wait_for_node
raise TimeoutError(
TimeoutError: Timed out after 60 seconds while waiting for node to startup. Did not find socket name tcp://127.0.0.1:58449 in the list of object store socket names.
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "F:\.......\venvs\G2OP_RLLIB\lib\site-packages\ray\_private\client_mode_hook.py", line 103, in wrapper
return func(*args, **kwargs)
File "F:\.......\venvs\G2OP_RLLIB\lib\site-packages\ray\_private\worker.py", line 1534, in init
_global_node = ray._private.node.Node(
File "F:\.......\venvs\G2OP_RLLIB\lib\site-packages\ray\_private\node.py", line 298, in __init__
raise Exception(
Exception: The current node timed out during startup. This could happen because some of the Ray processes failed to startup.
80 cores as seen on Task manager
Maybe connected to the large number of cores taking too long to spin up all the processes on windows. In order to test that hypothesis, could you try ray.init(num_cpus=N)
and figure out how large N
can be before the function fails? I do not have access to such a large machine.
Okay so I did a ray.init(num_cpus=2)
and got the following message and the terminal froze. The grpcio version I have is 1.56.0
2023-07-26 10:48:42,392 INFO worker.py:1636 -- Started a local Ray instance.
RayContext(dashboard_url='', python_version='3.8.5', ray_version='2.5.1', ray_commit='a03efd9931128d387649dd48b0e4864b43d3bfb4', address_info={'node
_ip_address': '127.0.0.1', 'raylet_ip_address': '127.0.0.1', 'redis_address': None, 'object_store_address': 'tcp://127.0.0.1:65389', 'raylet_socket_
name': 'tcp://127.0.0.1:65442', 'webui_url': '', 'session_dir': 'C:\\Users\\......\\AppData\\Local\\Temp\\6\\ray\\session_2023-07-26_10-48-36_478614_
41080', 'metrics_export_port': 65526, 'gcs_address': '127.0.0.1:65386', 'address': '127.0.0.1:65386', 'dashboard_agent_listen_port': 52365, 'node_id
': 'd1b0c0601ebf26445176b6bb12a84bf4bc1a49f1c3414177d2b0e103'})
>>> (raylet) [2023-07-26 10:48:45,736 E 51140 31988] (raylet.exe) agent_manager.cc:135: The raylet exited immediately because the Ray agent failed.
The raylet fate shares with the agent. This can happen because the Ray agent was unexpectedly killed or failed. Agent can fail when
(raylet) - The version of `grpcio` doesn't follow Ray's requirement. Agent can segfault with the incorrect `grpcio` version. Check the grpcio versio
n `pip freeze | grep grpcio`.
(raylet) - The agent failed to start because of unexpected error or port conflict. Read the log `cat /tmp/ray/session_latest/logs/dashboard_agent.lo
g`. You can find the log file structure here https://docs.ray.io/en/master/ray-observability/ray-logging.html#logging-directory-structure.
(raylet) - The agent is killed by the OS (e.g., out of memory).
(raylet) *** SIGTERM received at time=1690382925 ***
(raylet) @ 00007FF61E2B0F6C (unknown) (unknown)
(raylet) @ 00007FF61E2B03DE (unknown) (unknown)
(raylet) @ 00007FF8E27B268A (unknown) o_exp
(raylet) @ 00007FF8E5FA7AC4 (unknown) BaseThreadInitThunk
(raylet) @ 00007FF8E627A351 (unknown) RtlUserThreadStart
(raylet) [2023-07-26 10:48:45,751 E 51140 31988] (raylet.exe) logging.cc:361: *** SIGTERM received at time=1690382925 ***
(raylet) [2023-07-26 10:48:45,751 E 51140 31988] (raylet.exe) logging.cc:361: @ 00007FF61E2B0F6C (unknown) (unknown)
(raylet) [2023-07-26 10:48:45,751 E 51140 31988] (raylet.exe) logging.cc:361: @ 00007FF61E2B03DE (unknown) (unknown)
(raylet) [2023-07-26 10:48:45,751 E 51140 31988] (raylet.exe) logging.cc:361: @ 00007FF8E27B268A (unknown) o_exp
(raylet) [2023-07-26 10:48:45,751 E 51140 31988] (raylet.exe) logging.cc:361: @ 00007FF8E5FA7AC4 (unknown) BaseThreadInitThunk
(raylet) [2023-07-26 10:48:45,751 E 51140 31988] (raylet.exe) logging.cc:361: @ 00007FF8E627A351 (unknown) RtlUserThreadStart
2023-07-26 10:48:59,384 WARNING worker.py:2019 -- The node with node id: d1b0c0601ebf26445176b6bb12a84bf4bc1a49f1c3414177d2b0e103 and address: 127.0
.0.1 and node name: 127.0.0.1 has been marked dead because the detector has missed too many heartbeats from it. This can happen when a (1) raylet c
rashes unexpectedly (OOM, preempted node, etc.)
(2) raylet has lagging heartbeats due to slow network or busy workload.
@mattip Could you pls provide an update on this issue? Thanks.
If ray 2.7 moves us to a world without python grpcio, we can revisit the issues that have
The version of
grpcio
doesn't follow Ray's requirements
in the logs
There are quite a few similar errors, I think they all using a non-compliant version of grpcio
@AvisP does this still happen on ray 2.8?
@mattip For the windows server with lots of cores I got this when I did a ray.init(num_cpus=2)
on python 3.10.11
2023-11-22 16:47:09,371 INFO worker.py:1673 -- Started a local Ray instance.
RayContext(dashboard_url='', python_version='3.10.11', ray_version='2.8.0', ray_commit='105355bd253d6538ed34d331f6a4bdf0e38ace3a', protocol_version=None)
>>> [33m(raylet)[0m [2023-11-22 16:47:12,284 E 68664 94064] (raylet.exe) agent_manager.cc:70: The raylet exited immediately because one Ray agent failed, agent_name = runtime_env_agent.
[33m(raylet)[0m The raylet fate shares with the agent. This can happen because
[33m(raylet)[0m - The version of `grpcio` doesn't follow Ray's requirement. Agent can segfault with the incorrect `grpcio` version. Check the grpcio version `pip freeze | grep grpcio`.
[33m(raylet)[0m - The agent failed to start because of unexpected error or port conflict. Read the log `cat /tmp/ray/session_latest/logs/{dashboard_agent|runtime_env_agent}.log`. You can find the log file structure here https://docs.ray.io/en/master/ray-observability/ray-logging.html#logging-directory-structure.
[33m(raylet)[0m - The agent is killed by the OS (e.g., out of memory).
[33m(raylet)[0m *** SIGTERM received at time=1700689632 ***
[33m(raylet)[0m @ 00007FF6A41785C6 (unknown) (unknown)
[33m(raylet)[0m @ 00007FF6A418EE86 (unknown) (unknown)
[33m(raylet)[0m @ 00007FF6A418E5BE (unknown) (unknown)
[33m(raylet)[0m @ 00007FF8C242268A (unknown) o_exp
[33m(raylet)[0m @ 00007FF8C2797AC4 (unknown) BaseThreadInitThunk
[33m(raylet)[0m @ 00007FF8C551A351 (unknown) RtlUserThreadStart
[33m(raylet)[0m [2023-11-22 16:47:12,303 E 68664 94064] (raylet.exe) logging.cc:361: *** SIGTERM received at time=1700689632 ***
[33m(raylet)[0m [2023-11-22 16:47:12,303 E 68664 94064] (raylet.exe) logging.cc:361: @ 00007FF6A41785C6 (unknown) (unknown)
[33m(raylet)[0m [2023-11-22 16:47:12,303 E 68664 94064] (raylet.exe) logging.cc:361: @ 00007FF6A418EE86 (unknown) (unknown)
[33m(raylet)[0m [2023-11-22 16:47:12,303 E 68664 94064] (raylet.exe) logging.cc:361: @ 00007FF6A418E5BE (unknown) (unknown)
[33m(raylet)[0m [2023-11-22 16:47:12,303 E 68664 94064] (raylet.exe) logging.cc:361: @ 00007FF8C242268A (unknown) o_exp
[33m(raylet)[0m [2023-11-22 16:47:12,303 E 68664 94064] (raylet.exe) logging.cc:361: @ 00007FF8C2797AC4 (unknown) BaseThreadInitThunk
[33m(raylet)[0m [2023-11-22 16:47:12,303 E 68664 94064] (raylet.exe) logging.cc:361: @ 00007FF8C551A351 (unknown) RtlUserThreadStart
2023-11-22 16:47:26,277 WARNING worker.py:2074 -- The node with node id: 4fa93e7451f6a84361e9c2a042e02e8cfe57ae10e06bc9cf428a4ddf and address: 127.0.0.1 and node name: 127.0.0.1 has been marked dead because the detector has missed too many heartbeats from it. This can happen when a (1) raylet crashes unexpectedly (OOM, preempted node, etc.)
(2) raylet has lagging heartbeats due to slow network or busy workload.
Then I interrupted using keyboard and did a ray.shutdown()
and tried initializing again and then I got
2023-11-22 16:48:22,439 INFO worker.py:1673 -- Started a local Ray instance.
RayContext(dashboard_url='', python_version='3.10.11', ray_version='2.8.0', ray_commit='105355bd253d6538ed34d331f6a4bdf0e38ace3a', protocol_version=None)
Although it works fine on windows laptop with smaller number of cores.
For python 3.11
I can't even install ray on a windows based environment due to this issue https://github.com/ray-project/ray/issues/39727 https://github.com/ray-project/ray/issues/38300
I am intrigued by the The version of
grpcio doesn't follow Ray's requirement
in the log. How exactly are you installing the software stack? What is pulling in a non-compliant version of grpcio
? Perhaps you are seeing some stall due to some other package, like NumPy, that tries to open worker threads on every core?
@AvisP can you provide more details to @mattip Q above?
I think I just did pip install ray
, I wanted to confirm by testing on the server with multiple cores running Windows but I can get access end of next week and provide more info on what I see during installation.
Any update of how you installed the software?
Sorry for late response. So I created a new virtual environment with python 3.10.11
and did a pip install ray
which installed version 2.9.2
. Then I did a import ray
and then a ray.init(num_cpus=4)
and it gave the following error message
>>> ray.init(num_cpus=4)
Traceback (most recent call last):
File "F:\AvisPaul\venvs\ray_3_10_test\lib\site-packages\ray\_private\node.py", line 312, in __init__
ray._private.services.wait_for_node(
File "F:\AvisPaul\venvs\ray_3_10_test\lib\site-packages\ray\_private\services.py", line 463, in wait_for_node
raise TimeoutError(
TimeoutError: Timed out after 60 seconds while waiting for node to startup. Did not find socket name tcp://127.0.0.1:63251 in the list of object store socket names.
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "F:\AvisPaul\venvs\ray_3_10_test\lib\site-packages\ray\_private\client_mode_hook.py", line 103, in wrapper
return func(*args, **kwargs)
File "F:\AvisPaul\venvs\ray_3_10_test\lib\site-packages\ray\_private\worker.py", line 1618, in init
_global_node = ray._private.node.Node(
File "F:\AvisPaul\venvs\ray_3_10_test\lib\site-packages\ray\_private\node.py", line 317, in __init__
raise Exception(
Exception: The current node timed out during startup. This could happen because some of the Ray processes failed to startup.
I exited the python environment and tried it again and it seemed to work with the following message
2024-02-21 15:29:34,099 INFO worker.py:1724 -- Started a local Ray instance.
RayContext(dashboard_url='', python_version='3.10.11', ray_version='2.9.2', ray_commit='fce7a361807580953364e2da964f9498f3123bf9', protocol_version=None)
However if I do only ray.init()
was leaving it at 2024-02-21 15:33:16,903 INFO worker.py:1724 -- Started a local Ray instance.
indefinitely. So I did a ray.init(logging_level='debug')
and this is what I got
2024-02-21 15:33:11,105 DEBUG worker.py:1483 -- Could not import resource module (on Windows)
2024-02-21 15:33:13,029 DEBUG node.py:1303 -- Process STDOUT and STDERR is being redirected to C:\Users\paula\AppData\Local\Temp\ray\session_2024-02-21_15-33-12_966592_144860\logs.
2024-02-21 15:33:16,404 DEBUG node.py:1332 -- Process STDOUT and STDERR is being redirected to C:\Users\paula\AppData\Local\Temp\ray\session_2024-02-21_15-33-12_966592_144860\logs.
2024-02-21 15:33:16,410 DEBUG npu.py:58 -- Could not import AscendCL: No module named 'acl'
2024-02-21 15:33:16,411 DEBUG tpu.py:119 -- Failed to detect number of TPUs: [WinError 3] The system cannot find the path specified: '/dev/vfio'
2024-02-21 15:33:16,624 DEBUG services.py:2064 -- Determine to start the Plasma object store with 92.53 GB memory using C:\Users\paula\AppData\Local\Temp.
2024-02-21 15:33:16,903 INFO worker.py:1724 -- Started a local Ray instance.
Where did you get python, ~how are you installing ray~? We have seen some problems with the python from the windows app store.
I downloaded the installer for 3.10.11
from here. If there is another version that you know has no issues I can try that
No, that build should work well, but the first time you run it you will have to allow firewall exceptions via a giu pop-up box. Did you see that?
The last log snippet on your last attempt looks like it is all working: Started a local Ray instance.
. Could you summarize for me where this all stands? I am confused as to what works and what doesn't.
I got a similar problem:
Python 3.11.9 (main, Apr 19 2024, 16:48:06) [GCC 11.2.0]
Type 'copyright', 'credits' or 'license' for more information
IPython 8.24.0 -- An enhanced Interactive Python. Type '?' for help.
In [1]: import ray
In [2]: ray.init(num_cpus=2, logging_level="debug")
2024-05-15 16:15:24,958 DEBUG node.py:293 -- Setting node ID to 2e3620292cacbc0db97e26bbb3f3fbd3e3733955eb3b7f3aa12b71fb
2024-05-15 16:15:24,959 DEBUG node.py:1344 -- Process STDOUT and STDERR is being redirected to /tmp/ray/session_2024-05-15_16-15-24_957954_661582/logs.
2024-05-15 16:15:26,363 DEBUG node.py:1373 -- Process STDOUT and STDERR is being redirected to /tmp/ray/session_2024-05-15_16-15-24_957954_661582/logs.
2024-05-15 16:15:29,428 DEBUG npu.py:60 -- Could not import AscendCL: No module named 'acl'
2024-05-15 16:15:29,430 DEBUG services.py:2077 -- Determine to start the Plasma object store with 78.2 GB memory using /dev/shm.
2024-05-15 16:15:30,588 INFO worker.py:1749 -- Started a local Ray instance.
Out[2]: RayContext(dashboard_url='', python_version='3.11.9', ray_version='2.22.0', ray_commit='a8ab7b87a59b833242fe3564c61d7b39265cd838')
In [3]: ray.shutdown()
2 cores works
In [4]: ray.init(num_cpus=5, logging_level="debug")
2024-05-15 16:15:50,060 DEBUG node.py:293 -- Setting node ID to 97a9a7a0d6b57dbb9c182dd042fa56b7aae8c9f04dd61fa1ac1b0a7c
2024-05-15 16:15:50,061 DEBUG node.py:1344 -- Process STDOUT and STDERR is being redirected to /tmp/ray/session_2024-05-15_16-15-50_060016_661582/logs.
2024-05-15 16:15:51,436 DEBUG node.py:1373 -- Process STDOUT and STDERR is being redirected to /tmp/ray/session_2024-05-15_16-15-50_060016_661582/logs.
2024-05-15 16:15:54,531 DEBUG npu.py:60 -- Could not import AscendCL: No module named 'acl'
2024-05-15 16:15:54,532 DEBUG services.py:2077 -- Determine to start the Plasma object store with 78.21 GB memory using /dev/shm.
2024-05-15 16:15:55,588 INFO worker.py:1749 -- Started a local Ray instance.
This hang forever.
@albert-ying for completeness: what machine are you running this on, how many CPUs does it have?
@mattip Arch Linux x86_64, 112 cores
80 cores as seen on Task manager
Maybe connected to the large number of cores taking too long to spin up all the processes on windows. In order to test that hypothesis, could you try
ray.init(num_cpus=N)
and figure out how largeN
can be before the function fails? I do not have access to such a large machine.
I have a similar problem, and this is useful in my issue. But I think it's not friendly enough that it only shows "INFO worker.py:1794 -- Started a local Ray instance" when ray.init()
is executing. How can I get more information to ensure the program is running normally?
How can I get more information
Either ray status
or open the dashboard. But note #45578 which is meant to fix the dashboard reporting on windows
Closing this. Please reopen or open a new issue if the problem reappears
i also encountered this hang issue. i did test @albert-ying's method, test on python interpreter shell, by increasing the num_cpus, i don't get any error. does Started a local Ray instance
needs to download anything from the internet, can this related to URL blocked?
@cometta This issue is closed, please open a new one with
- details about how you installed ray and which version
- version of python and platform
- copies of the relevant logs.