ant-ray icon indicating copy to clipboard operation
ant-ray copied to clipboard

[FlowInsight] The error in obtaining the host machine pid in the container causes the physical view GPU information to be incorrect.

Open zhouhansheng opened this issue 2 months ago • 1 comments

What happened + What you expected to happen

https://github.com/antgroup/ant-ray/blob/main/python/ray/dashboard/modules/reporter/reporter_agent.py#L640C12-L662C34

for container_pid in proc_dirs:
    sched_path = f"/proc/{container_pid}/sched"
    if os.path.exists(sched_path):
        try:
            with open(sched_path, "r") as f:
                first_line = f.readline()
                # Extract host PID using regex
                match = pattern.search(first_line)
                if match:
                    host_pid = int(match.group(1))
                    # Only store if it's one of the PIDs we're looking for
                    if host_pid in host_pids_to_find:
                        host_to_container_pid_map[host_pid] = int(
                            container_pid
                        )
                        # If we've found all the PIDs we need, we can stop searching
                        if len(host_to_container_pid_map) == len(
                            host_pids_to_find
                        ):
                            break
        except (IOError, ValueError):
            # Skip files we can't read or parse
            continue

Here, by traversing the /proc/{container_pid}/sched file, the host pid is obtained from the pid in the first line. This should be a bug. The pid recorded in the first line of /proc/{container_pid}/sched should only be the pid in the container.

Process pid using the GPU in the container: Image

Corresponding host process pid: Image

The physical view GPU information is missing due to the host machine PID error obtained in the container: Image

If you start the container and specify --pid=host, the gpu indicators can be displayed normally: Image

Versions / Dependencies

ant-ray/main

Reproduction script

The startup command that have the bug: docker run -p 28268:28268 --runtime=nvidia -itd --name own-dashboard --shm-size=16gb --privileged --network dockerBridge ant-ray:verl040-raymain bash

The container startup command without the bug is: docker run -p 28268:28268 --runtime=nvidia --pid=host -itd --name own-dashboard --shm-size=16gb --privileged --network dockerBridge ant-ray:verl040-raymain bash

The different parameter is --pid=host.

Issue Severity

Low: It annoys or frustrates me.

zhouhansheng avatar Oct 24 '25 03:10 zhouhansheng

@xsuler Please take a look at this issue.

weiquanlee avatar Oct 30 '25 03:10 weiquanlee

@weiquanlee @xsuler What solution do you plan to use to address this issue? As far as I know, for kernels that have already fixed the /proc/pid/sched bug, either pid=host is required, or the host's /proc needs to be mounted. Both options depend on the user's startup command, which doesn't seem very elegant.

zzt93 avatar Nov 23 '25 12:11 zzt93