ray [Ray debugger] Unable to use debugger on Ray Cluster on k8s

What happened + What you expected to happen

I tried to use the debugger plug-in on VScode according to the guidance(https://www.anyscale.com/blog/ray-distributed-debugger), but when I click on a paused task to attach the VSCode debugger, I always get an error connect ECONNREFUSED $ip:port. I tried to enable the plug-in locally and it worked normally. I also tried to add the ray-debugger-external flag and tested that the Ray Cluster on k8s can enable the native debugger normally. I don’t know how to use the debugger plug-in on VScode on the Ray Cluster on k8s. Can you provide relevant guidance or help?

Versions / Dependencies

Ray 2.23.0 Python 3.10.12

Reproduction script

Sample code in guidance

Issue Severity

High: It blocks me from completing my task.

May 24 '24 09:05 yx367563

Or do I need to configure launch.json in vscode?

May 24 '24 09:05 yx367563

I think the problem is that Ray debugger uses a random port, so it's not possible to know ahead which port to open when running on Kubernetes

From https://github.com/ray-project/ray/blob/master/python/ray/util/debugpy.py:

def _ensure_debugger_port_open_thread_safe():
            (...)
            (host, port) = debugpy.listen(
                (ray._private.worker.global_worker.node_ip_address, 0)
            )

And from definition of listen() in https://github.com/microsoft/debugpy/blob/main/src/debugpy/public_api.

This may be different from address if port was 0 in the latter, in which case the adapter will pick some unused ephemeral port to listen on.

In our case we're running ephemeral Ray clusters using RayJob resource definition from KubeRay, so we could specify a single port. In case of static Ray clusters, could a port range be a solution?

Oct 01 '24 03:10 rasmus-unity

@brycehuang30 does the new distributed debugger have this capability? if we don't I say we build forward and add this as a feature request to that.

Oct 03 '24 21:10 anyscalesam

Distributed debugger currently cannot custom the debugging ports. I think we could solve this in two steps:

let debugger only use a fixed range of port numbers, e.g. 50000 - 51000. This allows users to set the port open in k8s
enable custom port range setting, so users could choose the port range

Oct 03 '24 21:10 brycehuang30

Thanks for looking into this. One more thing that needs to be considered when debugging a job running in Kubernetes, is the IP address

From Ray job log:

2024-10-03 11:07:18,021	INFO debugpy.py:66 -- Ray debugger is listening on 100.104.4.3:34983
2024-10-03 11:07:18,023	INFO debugpy.py:87 -- Waiting for debugger to attach...

That IP address 100.104.4.3 is internal to the Kubernetes cluster, so when trying to debug from VS Code, getting error (In this case 127.0.0.1:8265 is being port-forwarded from Ray dashboard running in Kubernetes)

Possibly VS Code debugger plugin should connect to the external IP address of head node, and not the internal node IP address?

Oct 03 '24 22:10 rasmus-unity

I've run into an issue that seems very similar to this one. In fact, it might very well be the same issue.

I'm using Ray 2.30 and I get a connection refused error when I try to connect vs code to the paused task. I noticed that debugpy on the task actually crashes soon after debugpy.listen(...) is called, so by the time I'm trying to connect vs code, nothing is listening on the configured port anymore (the port as printed in the Ray debugger is listening on <ip>:<port> log msg)

I also tried ray 2.39: same issue
I tried patching ray to make debugpy run on a fixed port, and/or localhost/0.0.0.0 ip (combined with kubectl port forwarding): all to no avail. In all cases, the root issue seems that nothing is listening anymore on the port where debugpy is supposed to listen.
I tried running debugpy.listen on a k8s pod without ray, and in that case it works fine. Using lsof I can see that something is listening on the configured listen port.
The underlying debugpy crash is hard to detect, apart from that nothing is listening on the port... However, if you enable extra logging you can see it crash (BrokenPipeError) in the logs though. I reported this issue in debugpy here (with details on how to find the crash msg in debugpy.pydevd.NNNN.log): https://github.com/microsoft/debugpy/issues/1749

Nov 30 '24 03:11 koenlek

Distributed debugger currently cannot custom the debugging ports. I think we could solve this in two steps:

let debugger only use a fixed range of port numbers, e.g. 50000 - 51000. This allows users to set the port open in k8s

enable custom port range setting, so users could choose the port range

We are facing the same issue with our local Ray Cluster but in our case behind docker-compose for local development/testing.

We were wondering if the suggested solution like an optional parameter to specify debugpy ports is still an option or if there is any other recommendation to overcome the issue.

Jan 22 '25 15:01 rogerfydp

We ended up deploying https://docs.linuxserver.io/images/docker-code-server/ inside the Kubernetes cluster, which can then access the necessary ports

Jan 22 '25 15:01 rasmus-unity

@rasmus-unity Thank you for sharing. Can you explain the specific operation process?

At the same time, I noticed that there is a relevant pr (https://github.com/ray-project/ray/pull/49116). Can I assume that this requirement can be met by referring to this document? cc @brycehuang30

Jan 23 '25 01:01 Moonquakes

@rasmus-unity and @Moonquakes, thank you for your insights!

To test this, we created a Dockerfile based on Ray images and installed the SSH server as mentioned in #49116, along with other necessary components. We were seeking a solution for agile local development and debugging, so we also ended up having to mount the source code being developed as volumes on the Ray head node and installed various tools such as Devbox that we require for developing. This setup allowed us to develop directly on the Ray head and utilize the Ray Distributed Debugger extension, but we believe this is so much complexity added aside from installing unnecessary stuff on the ray-head that could potentially be overcome.

While this approach was useful and for the moment makes the trick for us, we still believe that having an out-of-the-box solution without the need to install SSH servers and other possible dependencies would be extremely valuable having already the amazing provided Ray Distributed Debugger extension. Implementing a way to configure a range of ports for debugpy to listen on, as previously suggested by @brycehuang30, would greatly enhance the development experience in our opinion.

Jan 24 '25 15:01 rogerfydp

Hi @rogerfydp , Could you explain your operation steps and dockerfile in more detail? I installed ssh according to the instructions in https://github.com/ray-project/ray/pull/49116, and opened port 22. It seems that there will be other problems. Kuberay will open some ports by default when the port is not filled in, but it will not be added if 22 is added manually (https://github.com/ray-project/kuberay/blob/v1.2.2/ray-operator/controllers/ray/common/service.go#L409-L417).

Feb 05 '25 08:02 Moonquakes

Hi @pcmoritz , will we consider supporting the use of distributed debuggers in kuberay environments without installing ssh? Installing ssh is still a heavy dependency and may cause certain security risks. A better mechanism may be to expose fixed ports to the outside and then just listen on these ports.

It seems that there is an external plugin that can already achieve this function, FYI: https://ray.slack.com/archives/C01DLHZHRBJ/p1722501457132069, https://github.com/zen-xu/plan-d

Feb 20 '25 02:02 Moonquakes

The use of SSH in https://github.com/ray-project/ray/pull/49116 is just an example and you don't actually need to use SSH -- you can just run the vscode server inside the cluster in any way you like (maybe as a side car, or directly in the ray container, or maybe even as a separate deployment). The above suggested https://docs.linuxserver.io/images/docker-code-server/ could be an option or probably you can also just run the upstream https://code.visualstudio.com/docs/remote/vscode-server. In those cases you just need to expose the VS Code frontend to the users.

If somebody has such a setup and wants to contribute a PR, that would be most welcome! You could e.g. change the KubeRay tab in https://github.com/ray-project/ray/pull/49116 to KubeRay (SSH) and add another tab with KubeRay (VS Code Server) with your instructions :)

Feb 20 '25 03:02 pcmoritz

@pcmoritz Oh I know, I meant that I want users to be able to use the local vscode, not the vscode/vscode frontend that we deploy on the server side. All the user needs to do is port-forward some interface and let the local vscode plugin listen on it.

Feb 20 '25 05:02 Moonquakes