[FlowInsight] The error in obtaining the host machine pid in the container causes the physical view GPU information to be incorrect.
What happened + What you expected to happen
https://github.com/antgroup/ant-ray/blob/main/python/ray/dashboard/modules/reporter/reporter_agent.py#L640C12-L662C34
for container_pid in proc_dirs:
sched_path = f"/proc/{container_pid}/sched"
if os.path.exists(sched_path):
try:
with open(sched_path, "r") as f:
first_line = f.readline()
# Extract host PID using regex
match = pattern.search(first_line)
if match:
host_pid = int(match.group(1))
# Only store if it's one of the PIDs we're looking for
if host_pid in host_pids_to_find:
host_to_container_pid_map[host_pid] = int(
container_pid
)
# If we've found all the PIDs we need, we can stop searching
if len(host_to_container_pid_map) == len(
host_pids_to_find
):
break
except (IOError, ValueError):
# Skip files we can't read or parse
continue
Here, by traversing the /proc/{container_pid}/sched file, the host pid is obtained from the pid in the first line. This should be a bug. The pid recorded in the first line of /proc/{container_pid}/sched should only be the pid in the container.
Process pid using the GPU in the container:
Corresponding host process pid:
The physical view GPU information is missing due to the host machine PID error obtained in the container:
If you start the container and specify --pid=host, the gpu indicators can be displayed normally:
Versions / Dependencies
ant-ray/main
Reproduction script
The startup command that have the bug:
docker run -p 28268:28268 --runtime=nvidia -itd --name own-dashboard --shm-size=16gb --privileged --network dockerBridge ant-ray:verl040-raymain bash
The container startup command without the bug is:
docker run -p 28268:28268 --runtime=nvidia --pid=host -itd --name own-dashboard --shm-size=16gb --privileged --network dockerBridge ant-ray:verl040-raymain bash
The different parameter is --pid=host.
Issue Severity
Low: It annoys or frustrates me.
@xsuler Please take a look at this issue.
@weiquanlee @xsuler What solution do you plan to use to address this issue? As far as I know, for kernels that have already fixed the /proc/pid/sched bug, either pid=host is required, or the host's /proc needs to be mounted. Both options depend on the user's startup command, which doesn't seem very elegant.