[Core] Allow GCS && runtime env agent to self-assign ports and report via pipe
Description
Previously, if the user did not specify them, Ray preassigned the GCS port, dashboard agent port, runtime environment port, etc., and passed them to each component at startup. This created a race condition: Ray might believe a port is free, but by the time the port information is propagated to each component, another process may have already bound to that port.
This can cause user-facing issues, for example when Raylet heartbeat messages are missed frequently enough that the GCS considers the node unhealthy and removes it.
We originally did this because there was no standard local service discovery, so components had no way to know each other’s serving ports unless they were preassigned. The feasible approaches for local service discovery are:
- The parent pre-reserves the port and passes an FD.
- The child calls bind() and reports the port via IPC/file/object store.
In our case, because we span both Python and C++ processes, and considering how other well-known projects (such as LLVM) solve this problem, I decided to use pipe(). You can see how lldb-server starts here:
- https://lldb.llvm.org/man/lldb-server.html
- https://github.com/llvm/llvm-project/blob/1d73b68463ba5ef75434f8d13390537b8e66efa9/lldb/tools/lldb-server/lldb-gdbserver.cpp#L332
lldb-server supports a port value of zero, in which case the port number is chosen dynamically and written to the destinations given by the --named-pipe and --pipe arguments.
At a high level, this PR addresses the issue for GCS and the runtime env agent. It:
- Adds the pipe mechanism.
- Adds GCS port reporting via pipe.
- Adds runtime env agent port reporting via pipe.
Considerations
-
Parallel process startup: Ray starts components nearly in parallel and currently relies on wait + timeouts for cross-process coordination. While the pipe mechanism was originally intended to enforce stricter startup ordering, doing so globally is impractical due to circular dependencies. Instead, this PR creates pipes where needed and allows a dependent component (e.g., GCS) to hold the read end in advance and wait until the other component (e.g., the dashboard agent) has bound its port and written it. The dashboard agent can therefore be started earlier or later without changing the behavior. This minimizes waiting time and avoids circular dependency issues.
-
Immediate port reporting: As soon as a component successfully binds to a port, it writes the value through the pipe. We don't wait for the entire service to finish initialization before reporting the port.
-
Removal of port pre-assignment: Because ports are now can be determined dynamically::
- The C++ core worker no longer receives ports at startup; it retrieves them from the Raylet during registration. This means
ray.init()to connecting to an existing raycluster no longer needs to know these ports ahead of worker startup. - Port information that was previously cached in local files is now surfaced through the runtime context API:
ray.get_runtime_context().get_runtime_env_agent_port().
- The C++ core worker no longer receives ports at startup; it retrieves them from the Raylet during registration. This means
Related issues
Closes https://github.com/ray-project/ray/issues/54321
Additional information
Process hierarchy for reporting runtime env port:
Node.py
├── ray client server (reader)
└── Raylet
├── Dashboard agent (will NOT have access to pipe)
└── Run time env agent (writer)
Process hierarchy for reporting GCS port:
Node.py(reader)
├── GCS (writer)
Ray.init() connecting to existing Ray cluster:
Driver -> C ++ core worker.
├── register and giving ports info to C ++ core worker
raylet
Test
For GCS-related work, here is a detailed test I wrote that covers seven starting/connecting cases:
- https://github.com/Yicheng-Lu-llll/ray/blob/port-self-discovery-test-file/python/ray/tests/test_gcs_port_reporting.py
For runtime env agent:
- https://github.com/Yicheng-Lu-llll/ray/blob/port-self-discovery-test-file/test_agent_port.py
Test that ray_client_server works correctly with dynamic runtime env agent port:
- https://github.com/Yicheng-Lu-llll/ray/blob/port-self-discovery-test-file/test_ray_client_with_runtime_env.py
Awesome @Yicheng-Lu-llll ❤️
Hey @edoakes, whenever you have a moment, I'd love to get your input on this.
Currently, GDB debugging in Ray requires running the process inside tmux, as enforced here: https://github.com/ray-project/ray/blob/d37dff6bc2a9e1decb860f4680205ed8c99e5623/python/ray/_private/services.py#L914
One side effect of this approach is that tmux breaks the parent–child relationship of the spawned process, which prevents pipe-based interactions from working as expected: https://github.com/ray-project/ray/blob/d37dff6bc2a9e1decb860f4680205ed8c99e5623/python/ray/_private/services.py#L952
For reference, the related documentation is here: https://github.com/ray-project/ray/blob/master/doc/source/ray-contribute/debugging.rst
Do you happen to have any background on why this tmux requirement was originally introduced? The code and docs seem to date back about five years, so I’m wondering whether this is still something we want to keep as-is, or if it might make sense to revisit it so that pipe-based interactions continue to work.
Do you happen to have any background on why this tmux requirement was originally introduced? The code and docs seem to date back about five years, so I’m wondering whether this is still something we want to keep as-is, or if it might make sense to revisit it so that pipe-based interactions continue to work.
I don't know, or have since forgotten. Revisiting this to ensure a parent-child relationship sounds like the right thing to do.
GCS Fault tolerance: GCS fault tolerance requires GCS to restart using exactly the same port, even if it initially starts with a dynamically assigned port (0). Before this PR, GCS cached the port in a file, and this PR preserves the same behavior (although ideally, the port should only be read from the file by the Raylet and its agent).
We should probably put in the follow ups some way to deal with this better. Requiring the GCS to start on the same port after restarts is still subject to a problem of the port getting taken by something else (though this is probably only a very remote possibility in the steady state). Proper service discovery is something we should think through. For example, if the GCS was just resolved via DNS or was discoverable via something like etcd or linkerd, or a myriad of options. But this shouldn't hold up the PR by any means, just maybe call it out in the follow ups section of the description.