ray icon indicating copy to clipboard operation
ray copied to clipboard

[Dashboard] Ray 2.10: CPU / RAM / GPU usage not correctly displayed on Windows

Open PhilippWillms opened this issue 1 year ago • 28 comments

What happened + What you expected to happen

After installing ray version 2.10 from pip, there is an issue in CPU / RAM / GPU usage display in ray dashboard for RLlib rollout workers. This issue was first mentioned in Discourse Ray

image

Versions / Dependencies

ray==2.10.0 Windows==10.0.22631 Build 22631 Python==3.10.13

Reproduction script

# -*- coding: utf-8 -*-
"""File to reproduce logics applied in this library for external use."""
import logging
from typing import OrderedDict
from typing import Tuple

import gymnasium
import numpy as np
import ray
from gymnasium.spaces import Box
from gymnasium.spaces import Dict
from gymnasium.spaces import Discrete
from gymnasium.wrappers import TransformObservation
from ray.rllib.algorithms import PPOConfig
from ray.rllib.core.rl_module.rl_module import SingleAgentRLModuleSpec
from ray.rllib.examples.rl_module.action_masking_rlm import TorchActionMaskRLM
from ray.tune.registry import register_env

logger = logging.getLogger()
logger.setLevel("WARN")


class MyRealObsWrapper(TransformObservation):
    """Special Wrapper needed for new RLlib API stack."""

    def __init__(self, env):  # noqa: D107
        super().__init__(env, self.__transform)

    def __transform(self, orig_obs):  # noqa: D107
        new_obs = orig_obs
        for b in new_obs.keys():
            if b not in ["static_features"]:
                new_obs[b] = np.reshape(new_obs[b], -1)
        # Important to update the observation space, otherwise the RLlib algorithms will not work
        self.observation_space["observations"] = Box(
            0, 1, (len(new_obs["observations"]),)
        )
        return new_obs


class MyEnv(gymnasium.Env):
    """Simple custom environment with nested obs space and action masking."""

    def __init__(self, *args, **kwargs):  # noqa: D107
        print("Init method called.")
        self.action_space = Discrete(3)
        self.observation_space = Dict(
            {
                "action_mask": Box(
                    low=0, high=1, shape=(self.action_space.n,), dtype=np.int8
                ),
                "observations": Box(
                    low=0.0,
                    high=1.0,
                    shape=(3, 4),
                    dtype=np.float32,
                ),
                # "static_features": Dict(...)
            }
        )
        self.episode_done = False
        self._action_max_helper = np.ones(self.action_space.n, dtype=np.int8)
        self.state = np.zeros((3, 4), dtype=np.float32)

    def step(
        self, action: int
    ) -> Tuple[OrderedDict, float, bool, bool, dict]:  # noqa: D102
        print(f"Step function called with action {action}.")
        # Error handling for invalid action
        if (action < 0) | (action > self.action_space.n):
            e_string = f"Action [{action}] is not valid! Size of the action space: [{self.action_space.n}]."
            raise Exception(e_string)
        if self._action_max_helper[action] == 0:
            e_string = f"Action [{action}] is not valid as chosen already in episode !"
            raise Exception(e_string)

        some_dict = {}
        if action not in some_dict.keys():
            some_dict[action] = 1
            logger.warning("Action key added to dict.")
        print(f"Existing value in dict: {some_dict[action]}")

        reward = 0 - action
        self.state[action][0] = 1
        self._action_max_helper[action] = 0
        if all(self._action_max_helper[k] == 0 for k in range(3)):
            self.episode_done = True
        print(f"State after step: {self.state}.")
        return self._get_state_repr(), reward, self.episode_done, False, {}

    def _get_state_repr(self) -> OrderedDict:
        return {
            "action_mask": self._action_max_helper,
            "observations": self.state,
        }

    def reset(
        self, *, seed=None, options=None
    ) -> Tuple[OrderedDict, dict]:  # noqa: D102
        print("Reset method called.")
        self.episode_done = False
        # Initial state representation = shape of the obs space.
        self.state = np.zeros((3, 4), dtype=np.float32)
        # Initial action mask = all actions are allowed.
        self._action_max_helper = np.ones(self.action_space.n, dtype=np.int8)
        return self._get_state_repr(), {}


def env_creator(env_config):
    """Create the environment with a wrapper."""
    env = MyEnv()
    env = MyRealObsWrapper(env)
    return env


# Use classic API to register environment
register_env("myenv_wrapped", env_creator)

if __name__ == "__main__":
    rlm_spec = SingleAgentRLModuleSpec(module_class=TorchActionMaskRLM)

    # Algorithm Config, but with the latest RLlib API
    config = (
        PPOConfig()
        .environment("myenv_wrapped")
        # We need to disable preprocessing of observations, because preprocessing
        # would flatten the observation dict of the environment.
        .experimental(_disable_preprocessor_api=True, _enable_new_api_stack=True)
        .framework("torch")
        .resources(num_gpus=1, num_cpus_per_worker=2, num_gpus_per_worker=0.3)
        .rl_module(rl_module_spec=rlm_spec)
        .training(lr=1e-3, train_batch_size=50, sgd_minibatch_size=10)
    )

    algo = config.build()

    # run manual training loop and print results after each iteration
    for i in range(10):
        result = algo.train()
        print(f"Training iteration: {i+1} done")
        # pprint(result)

    ray.shutdown()

Issue Severity

Medium: It is a significant difficulty but I can work around it.

PhilippWillms avatar Apr 09 '24 19:04 PhilippWillms

@PhilippWillms does it always happen or only when you use rllib? If you just create some actors using ray core, can you use the cpu/memory?

scottsun94 avatar Apr 09 '24 19:04 scottsun94

@scottsun94 : Good point. Indeed it is also happening for "simple" actors from ray core.

from time import sleep
import ray

@ray.remote
class Counter:
    def __init__(self):
        self.i = 0

    def get(self):
        return self.i

    def incr(self, value):
        sleep(100)
        self.i += value

# Create a Counter actor.
c = Counter.remote()

# Submit calls to the actor. These calls run asynchronously but in
# submission order on the remote actor process.
for _ in range(10):
    c.incr.remote(1)

# Retrieve final actor state.
print(ray.get(c.get.remote()))

See the empty bars in columns "CPU" and "Memory".

image

PhilippWillms avatar Apr 09 '24 20:04 PhilippWillms

I believe we use psutil to get the cpu/memory usage of the actor. Do you have it installed in your env?

scottsun94 avatar Apr 09 '24 20:04 scottsun94

See the output of pip show psutil on my anaconda env.

Name: psutil Version: 5.9.8 Summary: Cross-platform lib for process and system monitoring in Python. Home-page: https://github.com/giampaolo/psutil Author: Giampaolo Rodola Author-email: [email protected] License: BSD-3-Clause Location: c:\users\philipp\anaconda3\envs\torch-gpu-310\lib\site-packages Requires: Required-by: gpustat, ipykernel, wandb

PhilippWillms avatar Apr 09 '24 20:04 PhilippWillms

https://github.com/ray-project/ray/assets/9677264/3e58f453-7b8b-4498-a29f-0689eafe533f

Hmm. I can partially reproduce it on my macbook pro. When I create an actor, it does show the same empty string/naN info but it goes back to normal after a few seconds

scottsun94 avatar Apr 09 '24 20:04 scottsun94

@anyscalesam I don't think this is specific to Rllib since @PhilippWillms still has this issue when using ray core.

@alanwguo do you know if this is more of a front-end issue or a core issue?

scottsun94 avatar Apr 09 '24 20:04 scottsun94

@mattip can you take a look at this and try to repro?

anyscalesam avatar Apr 24 '24 22:04 anyscalesam

image That may be caused by the way you put actor , my first supervisor actor can show correctly but other mass actor not work

tianyangbest avatar Apr 26 '24 08:04 tianyangbest

Further screenshot for your analysis. Running tune.Tuner today with RLlib's PPO on nightly built:

image

PhilippWillms avatar May 02 '24 13:05 PhilippWillms

I have a similar problem on Windows 10 after updating from Ray 2.7.1 to 2.20.0. Worker name, Memory, GPU GRAM are not showing correctly. In Ray 2.7.1 everything worked fine.
image

oli-walter avatar May 07 '24 09:05 oli-walter

Changed the title, this reproduces on windows10 as well.

mattip avatar May 15 '24 09:05 mattip

@brycehuang30 would you have context enough here to debug further with @mattip > we can grab 15m to do this over a call. If we need Core help pull in @hongchaodeng but can you take point on this?

Hongchao we should also check that this is Win only and doesn't also affect Linux and macosx.

anyscalesam avatar May 15 '24 17:05 anyscalesam

We made a little progress by looking at the logs. In dashboard_agent.log there was

an error looking for Process.num_fds

2024-05-16 20:44:31,121	ERROR reporter_agent.py:1218 -- Error publishing node physical stats.
Traceback (most recent call last):
  File "d:\temp\venv_311\Lib\site-packages\ray\dashboard\modules\reporter\reporter_agent.py", line 1201, in _perform_iteration
    if not self._metrics_collection_disabled:
            ^^^^^^^^^^^^^^^^^^^^^
  File "d:\temp\venv_311\Lib\site-packages\ray\dashboard\modules\reporter\reporter_agent.py", line 724, in _get_all_stats
    "bootTime": self._get_boot_time(),
              ^^^^^^^^^^^^^^^^^^
  File "d:\temp\venv_311\Lib\site-packages\ray\dashboard\modules\reporter\reporter_agent.py", line 643, in _get_raylet
    return raylet_proc.as_dict(
           ^^^^^^^^^^^^^^^^^^^^
  File "d:\temp\venv_311\Lib\site-packages\ray\thirdparty_files\psutil\__init__.py", line 546, in as_dict
    raise ValueError(msg)
ValueError: invalid attr name 'num_fds'

that comes from here https://github.com/ray-project/ray/blob/65e13b94c30dab3441b537f0b2a51f9fb9e80c93/dashboard/modules/reporter/reporter_agent.py#L685-L695

That was added in #39790, and is only supported on Unix. Removing that locally got a little better, now I see the total memory being reported, but not the rss-in-use nor the cpu-in-use:


Clipboard01

mattip avatar May 16 '24 18:05 mattip

@mattip The actor table will try to load data by calling the /logical/actors endpoint defined in actor_head.py (code).

Could we collect some data by calling the endpoint localhost:8265/logical/actors directly?

brycehuang30 avatar May 16 '24 19:05 brycehuang30

The endpoint comes up empty:

<html><body>
<!--StartFragment-->
result | true
-- | --
msg | "All actors fetched."
data |  
actors | {}

<!--EndFragment-->
</body>
</html>

mattip avatar May 16 '24 19:05 mattip

hmm that's a bit surprising since I expect having some data like state: ALIVE at least

brycehuang30 avatar May 16 '24 19:05 brycehuang30

Uhh, trying again, now I get some content:

<html><body>
<!--StartFragment-->
result | true
-- | --
msg | "All actors fetched."
data |  
actors |  
bcb15a485f24161d7f01e38e01000000 |  
actorId | "bcb15a485f24161d7f01e38e01000000"
jobId | "01000000"
address |  
rayletId | "99e14a5148a8c2f602e23bd8372b38e59d67fcbd81acac7f90c35bd1"
ipAddress | "127.0.0.1"
port | 60064
workerId | "e1e6e1f4893fbcbc1230d5469115876f75aa2ea3030dd37c7c00f2f1"
className | "Counter"
state | "ALIVE"
numRestarts | "0"
name | ""
timestamp | 1715888760397
pid | 14240
startTime | 1715888760397
endTime | 0
reprName | ""
actorClass | "Counter"
exitDetail | "-"
requiredResources | {}
actorConstructor | "Unknown actor constructor"
gpus | []
processStats | null
mem |  
0 | 10485211136
1 | 5137608704
2 | 51
3 | 5347602432

<!--EndFragment-->
</body>
</html>

mattip avatar May 16 '24 19:05 mattip

Note processStats is null

mattip avatar May 16 '24 19:05 mattip

The endpoint will collect processStats here, and the source of the data is from reporter_agent.py -- which were the places we looked during the debugging

brycehuang30 avatar May 16 '24 19:05 brycehuang30

Tracing the gap between reporter and the DataSource -- I think this is where the data get filled into DataSource:

https://github.com/ray-project/ray/blob/856453f2648d08d3373531ac16de4bf7b7722acf/dashboard/modules/reporter/reporter_head.py#L649

brycehuang30 avatar May 16 '24 20:05 brycehuang30

I think the next step will be digging into the reporter_agent.py to see whether the metrics are read from psutil correctly

brycehuang30 avatar May 16 '24 20:05 brycehuang30

Going across recent ray versions:

  • 2.7.0 shows CPU and memory for the cluster, but there is no UI to show it for actors.
  • 2.8.0 the dashboard did not work, I see "GET http://localhost:8265/static/js/main.4e04a38d.js [HTTP/1.1 403 Forbidden 0ms]"
  • 2.9.0 t(after removing num_fds): CPU and memory for the cluster shows, and there is no UI to show it for actors
  • 2.10.0 (after removing num_fds): CPU and memory for the cluster shows, the UI to show it for actors is visible but the values are empty

So I think this has never worked for windows since it was added in 2.10.0, and I agree that reporter_agent.py would be the next stop. I tried to add some debug printing, but it seems the code is not called. Feel free to call me out if something here looks wrong, I will check it all again tomorrow.

mattip avatar May 16 '24 20:05 mattip

The summary looks good! When I was debugging ray python files, I sometimes need to restart ray cluster to see the updated code by ray stop; ray start --head. Maybe worth a try if the debug log still don't get printed.

brycehuang30 avatar May 16 '24 21:05 brycehuang30

I think this is where the data get filled into DataSource

That is in the http://127.0.0.1:8265/memory_profile endpoint. When I try that (on linux, with the test script running) I get an exception:

`Traceback (most recent call last):
  File "/tmp/venv311/lib/python3.11/site-packages/ray/dashboard/optional_utils.py", line 94, in _handler_route
    return await handler(bind_info.instance, req)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/tmp/venv311/lib/python3.11/site-packages/ray/dashboard/modules/reporter/reporter_head.py", line 534, in memory_profile
    pid = int(req.query["pid"])
              ~~~~~~~~~^^^^^^^\nKeyError: 'pid'

mattip avatar May 17 '24 09:05 mattip

I still cannot find the point at which the data is collected from the actor.

mattip avatar May 17 '24 09:05 mattip

Hi @mattip , the _perform_iteration function in reporter_agent is the thing that records the stats and pushes it to the head node, where it is provided as an API to power the UI.

_get_all_stats collects the metrics from the machine using libraries like psutil

_record_stats is used to publish some of these metrics to prometheus, it's not relevant to the Dashboard UI.

await publisher.publish_resource_usage(self._key, jsonify_asdict(stats)) is what publishes the stats to a GCS pubsub

in reporter_head.py, the key, data = await subscriber.poll() line listens to these events and stores the values into DataSource, which is used to power the APIs that power the UI.

Hope that helps! Can you see if any of those points are not working correctly on a windows system?

alanwguo avatar May 23 '24 20:05 alanwguo

I think the problem all the way down in _get_workers, which is returning an empty list on windows. It seems the only child of raylet_proc here is the agent. On linux this is not the case, the children are all the workers plus the agent.

mattip avatar May 26 '24 11:05 mattip

Getting somewhere.

  • Windows has no num_fds in psutils data (as found above)
  • Windows wraps the python process in a runner, so doing psutil.Process.parent() will return the runner, not raylet.exe (in _get_raylet_proc). By recognizing this in both _get_raylet_proc and _get_agent_proc, I now get data from _get_workers. This is reflected in the stats from _get_all_stats and in jsonify_asdict(stats). But the dashboard is still not showing the data. Maybe the PID that the dashboard is looking for is not the correct one: again confusion between the runner and the actual process?

mattip avatar May 26 '24 11:05 mattip

Locally, the changes to reporter_agent.py in #45578 fixed the problems for me.

  • don't look for num_fds on windows
  • take the launcher into account when traversing proc.parent() and proc.child()

mattip avatar May 27 '24 16:05 mattip