[Dashboard] Ray 2.10: CPU / RAM / GPU usage not correctly displayed on Windows
What happened + What you expected to happen
After installing ray version 2.10 from pip, there is an issue in CPU / RAM / GPU usage display in ray dashboard for RLlib rollout workers. This issue was first mentioned in Discourse Ray
Versions / Dependencies
ray==2.10.0 Windows==10.0.22631 Build 22631 Python==3.10.13
Reproduction script
# -*- coding: utf-8 -*-
"""File to reproduce logics applied in this library for external use."""
import logging
from typing import OrderedDict
from typing import Tuple
import gymnasium
import numpy as np
import ray
from gymnasium.spaces import Box
from gymnasium.spaces import Dict
from gymnasium.spaces import Discrete
from gymnasium.wrappers import TransformObservation
from ray.rllib.algorithms import PPOConfig
from ray.rllib.core.rl_module.rl_module import SingleAgentRLModuleSpec
from ray.rllib.examples.rl_module.action_masking_rlm import TorchActionMaskRLM
from ray.tune.registry import register_env
logger = logging.getLogger()
logger.setLevel("WARN")
class MyRealObsWrapper(TransformObservation):
"""Special Wrapper needed for new RLlib API stack."""
def __init__(self, env): # noqa: D107
super().__init__(env, self.__transform)
def __transform(self, orig_obs): # noqa: D107
new_obs = orig_obs
for b in new_obs.keys():
if b not in ["static_features"]:
new_obs[b] = np.reshape(new_obs[b], -1)
# Important to update the observation space, otherwise the RLlib algorithms will not work
self.observation_space["observations"] = Box(
0, 1, (len(new_obs["observations"]),)
)
return new_obs
class MyEnv(gymnasium.Env):
"""Simple custom environment with nested obs space and action masking."""
def __init__(self, *args, **kwargs): # noqa: D107
print("Init method called.")
self.action_space = Discrete(3)
self.observation_space = Dict(
{
"action_mask": Box(
low=0, high=1, shape=(self.action_space.n,), dtype=np.int8
),
"observations": Box(
low=0.0,
high=1.0,
shape=(3, 4),
dtype=np.float32,
),
# "static_features": Dict(...)
}
)
self.episode_done = False
self._action_max_helper = np.ones(self.action_space.n, dtype=np.int8)
self.state = np.zeros((3, 4), dtype=np.float32)
def step(
self, action: int
) -> Tuple[OrderedDict, float, bool, bool, dict]: # noqa: D102
print(f"Step function called with action {action}.")
# Error handling for invalid action
if (action < 0) | (action > self.action_space.n):
e_string = f"Action [{action}] is not valid! Size of the action space: [{self.action_space.n}]."
raise Exception(e_string)
if self._action_max_helper[action] == 0:
e_string = f"Action [{action}] is not valid as chosen already in episode !"
raise Exception(e_string)
some_dict = {}
if action not in some_dict.keys():
some_dict[action] = 1
logger.warning("Action key added to dict.")
print(f"Existing value in dict: {some_dict[action]}")
reward = 0 - action
self.state[action][0] = 1
self._action_max_helper[action] = 0
if all(self._action_max_helper[k] == 0 for k in range(3)):
self.episode_done = True
print(f"State after step: {self.state}.")
return self._get_state_repr(), reward, self.episode_done, False, {}
def _get_state_repr(self) -> OrderedDict:
return {
"action_mask": self._action_max_helper,
"observations": self.state,
}
def reset(
self, *, seed=None, options=None
) -> Tuple[OrderedDict, dict]: # noqa: D102
print("Reset method called.")
self.episode_done = False
# Initial state representation = shape of the obs space.
self.state = np.zeros((3, 4), dtype=np.float32)
# Initial action mask = all actions are allowed.
self._action_max_helper = np.ones(self.action_space.n, dtype=np.int8)
return self._get_state_repr(), {}
def env_creator(env_config):
"""Create the environment with a wrapper."""
env = MyEnv()
env = MyRealObsWrapper(env)
return env
# Use classic API to register environment
register_env("myenv_wrapped", env_creator)
if __name__ == "__main__":
rlm_spec = SingleAgentRLModuleSpec(module_class=TorchActionMaskRLM)
# Algorithm Config, but with the latest RLlib API
config = (
PPOConfig()
.environment("myenv_wrapped")
# We need to disable preprocessing of observations, because preprocessing
# would flatten the observation dict of the environment.
.experimental(_disable_preprocessor_api=True, _enable_new_api_stack=True)
.framework("torch")
.resources(num_gpus=1, num_cpus_per_worker=2, num_gpus_per_worker=0.3)
.rl_module(rl_module_spec=rlm_spec)
.training(lr=1e-3, train_batch_size=50, sgd_minibatch_size=10)
)
algo = config.build()
# run manual training loop and print results after each iteration
for i in range(10):
result = algo.train()
print(f"Training iteration: {i+1} done")
# pprint(result)
ray.shutdown()
Issue Severity
Medium: It is a significant difficulty but I can work around it.
@PhilippWillms does it always happen or only when you use rllib? If you just create some actors using ray core, can you use the cpu/memory?
@scottsun94 : Good point. Indeed it is also happening for "simple" actors from ray core.
from time import sleep
import ray
@ray.remote
class Counter:
def __init__(self):
self.i = 0
def get(self):
return self.i
def incr(self, value):
sleep(100)
self.i += value
# Create a Counter actor.
c = Counter.remote()
# Submit calls to the actor. These calls run asynchronously but in
# submission order on the remote actor process.
for _ in range(10):
c.incr.remote(1)
# Retrieve final actor state.
print(ray.get(c.get.remote()))
See the empty bars in columns "CPU" and "Memory".
I believe we use psutil to get the cpu/memory usage of the actor. Do you have it installed in your env?
See the output of pip show psutil on my anaconda env.
Name: psutil Version: 5.9.8 Summary: Cross-platform lib for process and system monitoring in Python. Home-page: https://github.com/giampaolo/psutil Author: Giampaolo Rodola Author-email: [email protected] License: BSD-3-Clause Location: c:\users\philipp\anaconda3\envs\torch-gpu-310\lib\site-packages Requires: Required-by: gpustat, ipykernel, wandb
https://github.com/ray-project/ray/assets/9677264/3e58f453-7b8b-4498-a29f-0689eafe533f
Hmm. I can partially reproduce it on my macbook pro. When I create an actor, it does show the same empty string/naN info but it goes back to normal after a few seconds
@anyscalesam I don't think this is specific to Rllib since @PhilippWillms still has this issue when using ray core.
@alanwguo do you know if this is more of a front-end issue or a core issue?
@mattip can you take a look at this and try to repro?
That may be caused by the way you put actor , my first supervisor actor can show correctly but other mass actor not work
Further screenshot for your analysis. Running tune.Tuner today with RLlib's PPO on nightly built:
I have a similar problem on Windows 10 after updating from Ray 2.7.1 to 2.20.0. Worker name, Memory, GPU GRAM are not showing correctly. In Ray 2.7.1 everything worked fine.
Changed the title, this reproduces on windows10 as well.
@brycehuang30 would you have context enough here to debug further with @mattip > we can grab 15m to do this over a call. If we need Core help pull in @hongchaodeng but can you take point on this?
Hongchao we should also check that this is Win only and doesn't also affect Linux and macosx.
We made a little progress by looking at the logs. In dashboard_agent.log there was an error looking for
Process.num_fds
2024-05-16 20:44:31,121 ERROR reporter_agent.py:1218 -- Error publishing node physical stats.
Traceback (most recent call last):
File "d:\temp\venv_311\Lib\site-packages\ray\dashboard\modules\reporter\reporter_agent.py", line 1201, in _perform_iteration
if not self._metrics_collection_disabled:
^^^^^^^^^^^^^^^^^^^^^
File "d:\temp\venv_311\Lib\site-packages\ray\dashboard\modules\reporter\reporter_agent.py", line 724, in _get_all_stats
"bootTime": self._get_boot_time(),
^^^^^^^^^^^^^^^^^^
File "d:\temp\venv_311\Lib\site-packages\ray\dashboard\modules\reporter\reporter_agent.py", line 643, in _get_raylet
return raylet_proc.as_dict(
^^^^^^^^^^^^^^^^^^^^
File "d:\temp\venv_311\Lib\site-packages\ray\thirdparty_files\psutil\__init__.py", line 546, in as_dict
raise ValueError(msg)
ValueError: invalid attr name 'num_fds'
that comes from here https://github.com/ray-project/ray/blob/65e13b94c30dab3441b537f0b2a51f9fb9e80c93/dashboard/modules/reporter/reporter_agent.py#L685-L695
That was added in #39790, and is only supported on Unix. Removing that locally got a little better, now I see the total memory being reported, but not the rss-in-use nor the cpu-in-use:
@mattip The actor table will try to load data by calling the /logical/actors endpoint defined in actor_head.py (code).
Could we collect some data by calling the endpoint localhost:8265/logical/actors directly?
The endpoint comes up empty:
<html><body>
<!--StartFragment-->
result | true
-- | --
msg | "All actors fetched."
data |
actors | {}
<!--EndFragment-->
</body>
</html>
hmm that's a bit surprising since I expect having some data like state: ALIVE at least
Uhh, trying again, now I get some content:
<html><body>
<!--StartFragment-->
result | true
-- | --
msg | "All actors fetched."
data |
actors |
bcb15a485f24161d7f01e38e01000000 |
actorId | "bcb15a485f24161d7f01e38e01000000"
jobId | "01000000"
address |
rayletId | "99e14a5148a8c2f602e23bd8372b38e59d67fcbd81acac7f90c35bd1"
ipAddress | "127.0.0.1"
port | 60064
workerId | "e1e6e1f4893fbcbc1230d5469115876f75aa2ea3030dd37c7c00f2f1"
className | "Counter"
state | "ALIVE"
numRestarts | "0"
name | ""
timestamp | 1715888760397
pid | 14240
startTime | 1715888760397
endTime | 0
reprName | ""
actorClass | "Counter"
exitDetail | "-"
requiredResources | {}
actorConstructor | "Unknown actor constructor"
gpus | []
processStats | null
mem |
0 | 10485211136
1 | 5137608704
2 | 51
3 | 5347602432
<!--EndFragment-->
</body>
</html>
Note processStats is null
The endpoint will collect processStats here, and the source of the data is from reporter_agent.py -- which were the places we looked during the debugging
Tracing the gap between reporter and the DataSource -- I think this is where the data get filled into DataSource:
https://github.com/ray-project/ray/blob/856453f2648d08d3373531ac16de4bf7b7722acf/dashboard/modules/reporter/reporter_head.py#L649
I think the next step will be digging into the reporter_agent.py to see whether the metrics are read from psutil correctly
Going across recent ray versions:
- 2.7.0 shows CPU and memory for the cluster, but there is no UI to show it for actors.
- 2.8.0 the dashboard did not work, I see "GET http://localhost:8265/static/js/main.4e04a38d.js [HTTP/1.1 403 Forbidden 0ms]"
- 2.9.0 t(after removing
num_fds): CPU and memory for the cluster shows, and there is no UI to show it for actors - 2.10.0 (after removing
num_fds): CPU and memory for the cluster shows, the UI to show it for actors is visible but the values are empty
So I think this has never worked for windows since it was added in 2.10.0, and I agree that reporter_agent.py would be the next stop. I tried to add some debug printing, but it seems the code is not called. Feel free to call me out if something here looks wrong, I will check it all again tomorrow.
The summary looks good! When I was debugging ray python files, I sometimes need to restart ray cluster to see the updated code by ray stop; ray start --head. Maybe worth a try if the debug log still don't get printed.
I think this is where the data get filled into DataSource
That is in the http://127.0.0.1:8265/memory_profile endpoint. When I try that (on linux, with the test script running) I get an exception:
`Traceback (most recent call last):
File "/tmp/venv311/lib/python3.11/site-packages/ray/dashboard/optional_utils.py", line 94, in _handler_route
return await handler(bind_info.instance, req)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/tmp/venv311/lib/python3.11/site-packages/ray/dashboard/modules/reporter/reporter_head.py", line 534, in memory_profile
pid = int(req.query["pid"])
~~~~~~~~~^^^^^^^\nKeyError: 'pid'
I still cannot find the point at which the data is collected from the actor.
Hi @mattip , the _perform_iteration function in reporter_agent is the thing that records the stats and pushes it to the head node, where it is provided as an API to power the UI.
_get_all_stats collects the metrics from the machine using libraries like psutil
_record_stats is used to publish some of these metrics to prometheus, it's not relevant to the Dashboard UI.
await publisher.publish_resource_usage(self._key, jsonify_asdict(stats)) is what publishes the stats to a GCS pubsub
in reporter_head.py, the key, data = await subscriber.poll() line listens to these events and stores the values into DataSource, which is used to power the APIs that power the UI.
Hope that helps! Can you see if any of those points are not working correctly on a windows system?
I think the problem all the way down in _get_workers, which is returning an empty list on windows. It seems the only child of raylet_proc here is the agent. On linux this is not the case, the children are all the workers plus the agent.
Getting somewhere.
- Windows has no
num_fdsin psutils data (as found above) - Windows wraps the python process in a runner, so doing
psutil.Process.parent()will return the runner, notraylet.exe(in_get_raylet_proc). By recognizing this in both_get_raylet_procand_get_agent_proc, I now get data from_get_workers. This is reflected in thestatsfrom_get_all_statsand injsonify_asdict(stats). But the dashboard is still not showing the data. Maybe the PID that the dashboard is looking for is not the correct one: again confusion between the runner and the actual process?
Locally, the changes to reporter_agent.py in #45578 fixed the problems for me.
- don't look for
num_fdson windows - take the launcher into account when traversing
proc.parent()andproc.child()