ray [RLlib] Attribute error when trying to compute action after training Multi Agent PPO with New API Stack

What happened + What you expected to happen

After training Multi Agent PPO with new New API Stack under the guidance of how-to-use-the-new-api-stack I tried to compute actions:

    saved_algorithm = Algorithm.from_checkpoint(
        checkpoint=algorithm_path,
        policy_ids={"controlled_vehicle_0", "controlled_vehicle_1"},
        policy_mapping_fn=lambda agent_id, episode, **kwargs: f"controlled_vehicle_{agent_id}",
    )
    print("saved_algorithm type:", type(saved_algorithm))
    # Evaluate the model
    obs, info = env.reset()
    print("obs:", obs)
    actions = {}
    for agent_id, agent_obs in obs.items():
        policy_id = f"controlled_vehicle_{agent_id}"
        action = saved_algorithm.get_policy(policy_id).compute_single_action(agent_obs)
        actions[agent_id] = action
    print("action", actions)

but I get the error message:

AttributeError: 'MultiAgentEnvRunner' object has no attribute 'get_policy'

I also tried some other way like: action = saved_algorithm.compute_single_action(agent_obs, policy_id) but still get the same error message: AttributeError: 'MultiAgentEnvRunner' object has no attribute 'get_policy'. I have seen a similar issue in #40312, are these two issues the same?

detailed error message are as follows:

Traceback (most recent call last): File "test_evaluate.py", line 151, in evaluate_agent(saved_algorithm, env) File "test_evaluate.py", line 112, in evaluate_agent action = saved_algorithm.get_policy(policy_id).compute_single_action(agent_obs) File "C:\Users\Ice Cream\miniconda3\envs\env_highway\lib\site-packages\ray\util\tracing\tracing_helper.py", line 467, in _resume_span return method(self, *_args, **_kwargs) File "C:\Users\Ice Cream\miniconda3\envs\env_highway\lib\site-packages\ray\rllib\algorithms\algorithm.py", line 2051, in get_policy return self.workers.local_worker().get_policy(policy_id) AttributeError: 'MultiAgentEnvRunner' object has no attribute 'get_policy'

and before I call this method, I also printed the relevant info, this part looks normal:

saved_algorithm type: <class 'ray.rllib.algorithms.ppo.ppo.PPO'> saved_algorithm.get_config() <ray.rllib.algorithms.ppo.ppo.PPOConfig object at 0x0000014B0E4D4370>

through the code:

    print("saved_algorithm type:", type(saved_algorithm))
    print("saved_algorithm.get_config()",saved_algorithm.get_config())

Versions / Dependencies

Ray 2.10.0 Python 3.8.18 Windows11

Reproduction script

the code used for training is as follows:

    register_env("ray_dict_highway_env", create_env)
    config = (
        PPOConfig().environment(env="ray_dict_highway_env")
        .experimental(_enable_new_api_stack=True)
        .rollouts(env_runner_cls=MultiAgentEnvRunner)
        .resources(
            num_learner_workers=0,
            num_gpus_per_learner_worker=1,
            num_cpus_for_local_worker=1,
        )
        .training(model={"uses_new_env_runners": True})
        .multi_agent(
            policies={
                "controlled_vehicle_0",
                "controlled_vehicle_1"
            },
            policy_mapping_fn=lambda agent_id, episode, **kwargs: f"controlled_vehicle_{agent_id}",
        )
        .framework("torch")
    )
    current_script_directory = os.path.dirname(os.path.abspath(__file__))
    ray_result_path = os.path.join(current_script_directory, folder_path, "ray_results")
    tuner = tune.Tuner(
        "PPO",
        run_config=RunConfig(
            storage_path=ray_result_path,
            name="2-agent-PPO",
            stop={"timesteps_total": 5e5}
        ),
        param_space=config.to_dict() 
    )
    results = tuner.fit()

And the code for loading checkpoints:

algorithm_path = r"D:\DRL_Project\DRL_highway\experiments\hw-fast-ma-dict-v0_rllib-mappo\2024-04-01_01-28\ray_results\2-agent-PPO\PPO_ray_dict_highway_env_1c7ab_00000_0_2024-04-01_01-28-38\checkpoint_000000"
saved_algorithm = Algorithm.from_checkpoint(
        checkpoint=algorithm_path,
        policy_ids={"controlled_vehicle_0", "controlled_vehicle_1"},
        policy_mapping_fn=lambda agent_id, episode, **kwargs: f"controlled_vehicle_{agent_id}",
    )

Issue Severity

High: It blocks me from completing my task.

Apr 04 '24 11:04 Dr-IceCream

It seems I have initially solved this issue by referring to the solutions in #40312 and the comments in the ray.rllib.core.rl_module code. using the following code repalcing the original code:

# Evaluate the model
obs, info = env.reset()
print("obs:", obs)
actions = {}
for agent_id, agent_obs in obs.items():
    # Determine the policy ID for the current agent using the policy mapping function
    policy_id = f"controlled_vehicle_{agent_id}"
    # Compute actions for each agent
    rl_module = saved_algorithm.get_module(policy_id)
    fwd_ins = {"obs": torch.Tensor([agent_obs])}
    fwd_outputs = rl_module.forward_inference(fwd_ins)
    action_dist_class = rl_module.get_inference_action_dist_cls()
    action_dist = action_dist_class.from_logits(
        fwd_outputs["action_dist_inputs"]
    )
    action = action_dist.sample()[0].numpy()
    actions[agent_id] = action
# actions = saved_algorithm.compute_actions(obs)
print("actions: ", actions)

and the output is as follows:

actions: {0: array(4, dtype=int64), 1: array(3, dtype=int64)}

it seems to be working.

but when I used a similar code approach to conduct multiple episode evaluations, the results were significantly worse compared to the training outcomes reported during training: episode_len_mean dropped from about 29 to 7, and episode_reward_mean decreased from about 42 to 10. After recording the videos, it was also evident that the agent indeed had its own strategy π, but the performance was relatively poor. I suspect that the specific steps I used to calculate actions directly through the rl_module might differ from those actually used during training, but I am not clear on the specific steps for action calculation used during training. Therefore, I would like to ask if I might have indeed done something wrong in this part?

and I still wonder why can't directly use the saved_algorithm.get_policy(policy_id).compute_single_action(agent_obs) or saved_algorithm.compute_single_action(agent_obs, policy_id) for action computing? or does this mean that this invocation will have a new syntax in the new API stack.

Apr 05 '24 06:04 Dr-IceCream

I am facing a similar issue with the SingleAgentEnvRunner - see screeenshot attached.

issue

May 18 '24 09:05 sortiz-hub

i dont know why this issue is p2. should be p0

Jul 25 '24 17:07 vilmire

This issue is still persistent. Hey @simonsays1980 @sven1977. Is there any plan to fix this?

Aug 03 '24 10:08 grizzlybearg

I'm unfortunately having this issue too:

ppo_agents = Algorithm.from_checkpoint(checkpoint=checkpoint_path) actions = ppo_agents.compute_actions(observations=observations)

Results in:

AttributeError: 'MultiAgentEnvRunner' object has no attribute 'get_policy'

Applying a fix similar to what @Dr-IceCream did, which follows #40312, gives very degraded results, to the point where it's unusable :(

Since we've had a similar P1 issue, should this be upgraded? @simonsays1980 @sven1977

Aug 25 '24 11:08 adadelta

Same error here. I'm unable to use a trained model.

Sep 18 '24 11:09 ipsec

I am also seeing this error in both PPO and SAC. Is there a recommended workaround or stable commit to rollback to?

Nov 06 '24 19:11 lrsheldon

I see this error in my singleAgent too. Is it related to mismatch between New API stack and old one or something else? not any solution for this? 🙄

Nov 24 '24 11:11 AlizargarDeu

I found a solution for this. I changed the action call method (according to this: https://docs.ray.io/en/master/rllib/rllib-training.html) as below:

from ray.rllib.core.rl_module import RLModule


# Create only the neural network (RLModule) from our checkpoint.
  rl_module = RLModule.from_checkpoint(
      pathlib.Path(best_checkpoint) / "learner_group" / "learner" / "rl_module"
  )["default_policy"

for call action:

  while not terminated and not truncated:
        env.render()
        # Compute the next action from a batch (B=1) of observations.
        torch_obs_batch = torch.from_numpy(np.array([obs]))
        action_logits = rl_module.forward_inference({"obs": torch_obs_batch})[
            "action_dist_inputs"
       [ ]
        # The default RLModule used here produces action logits (from which
        # we'll have to sample an action or use the max-likelihood one).
        action_probs = torch.nn.functional.softmax(action_logits[0], dim=-1)
        action = np.random.choice(len(action_probs), p=action_probs.numpy())
        obs, reward, terminated, truncated, info = env.step(action)
        episode_return += reward

Nov 27 '24 12:11 AlizargarDeu

There does not seem to be any incentive to fix this, as this part of the code is labelled "OldAPIStack". I am going to try to export the model using torch directly. It would probably still be nice to have some convenience function in the Algorithm class to do the export directly.

Dec 17 '24 10:12 The3DWizard

Seeing the same issue.

Additionally,Algorithm.from_checkpiont(checkpoint) gives me a AttributeError: 'Checkpoint' object has no attribute 'as_posix' whereas Algorithm.from_checkpoint(checkpoint.path) does not.

Dec 22 '24 17:12 itwasabhi

Unfortunately I have not been able to use torch.onnx export methods so far to export the model. I need dictionary inputs for my model and I have not found a way to use the onnx export methods with dictionary input. So the situation is a bit frustrating right now. Hope I find a solution soon. Any help is appreciated as I cannot export my model right now.

Dec 23 '24 22:12 The3DWizard

Unfortunately I have not been able to use torch.onnx export methods so far to export the model. I need dictionary inputs for my model and I have not found a way to use the onnx export methods with dictionary input. So the situation is a bit frustrating right now. Hope I find a solution soon. Any help is appreciated as I cannot export my model right now.

Update: I was able to export my model using torch.onnx.export using Algorithm.get_module, which is part of the new api stack. The key was to use a dictionary input in the form of {{key1: value1, key2: value2, ...}, {}} for the args argument of the export function (note the second, in my case empty, dictionary). It is also important to account for batch size in your values passed via the dictionary. Here is an example that shows my use case:

rl_module_player_white = algo.get_module("policy_white")
batch_size = 1
dummy_input = ({
    "obs": torch.tensor(
        np.array(
            [[np.random.randint(0, n) for n in game_state_multi_discrete_sizes] for _ in range(batch_size)]
        ),
        dtype=torch.float32,
    ),
    "action_mask": torch.tensor(
        np.random.randint(0, 2, (batch_size, action_space_length)), dtype=torch.float32
    ),
}, {})
torch.onnx.export(
    rl_module_player_white,               # The model to export
    dummy_input,             # Dummy input for tracing
    "./export_white/model.onnx",               # Output path
    export_params=True,      # Store parameters in the model file
    opset_version=17,        # ONNX version
    do_constant_folding=True,# Optimize constant expressions
    input_names=["obs", "action_mask"],   # Names for model inputs
    output_names=["embeddings", "output"], # Names for model outputs
)

One more hint: To properly set the output just check the output of your rl_module with the dummy input to know the exact output you are getting. So for the example above this would look like: dummy_output = rl_module_player_white(dummy_input[0])

Jan 01 '25 11:01 The3DWizard

I am facing the same error with Ray RLlib versions 2.40.0 and 2.40.1. Is there any news about that? I can use the RL module approach, but the Algorithm seems simpler and straightforward.

Jan 23 '25 17:01 cleversonahum

@simonsays1980 @sven1977

The problem still persists(AttributeError: 'SingleAgentEnvRunner' object has no attribute 'get_policy') There isn't still any reliable fix for that??

Jan 28 '25 09:01 AlizargarDeu

Having the same problem. No intention to solve it?!

`

AttributeError Traceback (most recent call last) in <cell line: 0>() 15 if not done[agent_id]: 16 # Access the policy for the agent ---> 17 policy = algo.get_policy(f"policy_{agent_id[-1]}") 18 # Compute the action using the policy 19 actions[agent_id] = policy.compute_single_action(obs_i)

/usr/local/lib/python3.11/dist-packages/ray/rllib/algorithms/algorithm.py in get_policy(self, policy_id) 2107 policy_id: ID of the policy to return. 2108 """ -> 2109 return self.env_runner.get_policy(policy_id) 2110 2111 @PublicAPI

AttributeError: 'MultiAgentEnvRunner' object has no attribute 'get_policy' `

Jan 29 '25 20:01 vahidqo

I found a solution for this. I changed the action call method (according to this: https://docs.ray.io/en/master/rllib/rllib-training.html) as below:

from ray.rllib.core.rl_module import RLModule

Create only the neural network (RLModule) from our checkpoint.

rl_module = RLModule.from_checkpoint( pathlib.Path(best_checkpoint) / "learner_group" / "learner" / "rl_module" )["default_policy" for call action:

while not terminated and not truncated: env.render() # Compute the next action from a batch (B=1) of observations. torch_obs_batch = torch.from_numpy(np.array([obs])) action_logits = rl_module.forward_inference({"obs": torch_obs_batch}) "action_dist_inputs" [ ] # The default RLModule used here produces action logits (from which # we'll have to sample an action or use the max-likelihood one). action = torch.argmax(action_logits[0]).numpy() obs, reward, terminated, truncated, info = env.step(action) episode_return += reward

HI @AlizargarDeu : My saved checkpoint directory does not seem to have a learner_group subdirectory. Is that something that needs to be specified while saving the checkpoints?

Also @simonsays1980 : Is it possible to update the issue to p0. Being unable to use a trained model on the latest stable version of RLlib should imo be getting a higher priority.

Feb 10 '25 16:02 AdithyaRaman

Hi @AdithyaRaman No, actually when you use New API stack these will be created automatically in saved training directory , somewhere in your temp directory. It is as below: root>temp> chekpoint_tmp...> env_runners , lerner_group ....When you give the saved directory, it will be loaded accordingly.

Feb 10 '25 17:02 AlizargarDeu

Running into this issue as well on the new api stack, so imo +1 that this priority should be bumped up to P0 as well

Feb 22 '25 19:02 notjacobjun

Same question here, but I have a new idea to solve it. Restore trained model, and add some output in environment, after that contrinue to train one times. 😵‍💫😵‍💫😵‍💫

Feb 25 '25 07:02 sebastian-cao

Does that really work? I have been looking for viable work-arounds.

Mar 03 '25 17:03 AdithyaRaman

Hi @AdithyaRaman No, actually when you use New API stack these will be created automatically in saved training directory , somewhere in your temp directory. It is as below: root>temp> chekpoint_tmp...> env_runners , lerner_group ....When you give the saved directory, it will be loaded accordingly.

@AlizargarDeu : Got it. I am now succesfully able to rl_module.forward_inference(...) to get something but I am not sure this is giving the actions that I can plug in to the environment. When I look at rl_module.config to check at the action_space dimension I see the below output which says that my action space should be a vector of length 2. (And this is what I want to see. My environment needs an action in this specified action_space)

RLModuleConfig(observation_space=Box(0.0, 6.0, (15,), float64), action_space=Box(-1.0, 1.0, (2,), float64), inference_only=False, learner_only=False, model_config_dict={'twin_q': True}, catalog_class=<class 'ray.rllib.algorithms.sac.sac_catalog.SACCatalog'>)

However, The output of 'rl_module.forward_inference({"obs": torch_obs_batch})' is a vector of length 4. I am a bit confused on how I can make use of this to get my action which I can use to perform a environment step.

Mar 03 '25 20:03 AdithyaRaman

I think your problem is something between the trained model and your environment. Why it should give you a vector with 4 when you pass your environment with an action_space 2??!

may it helps you for training :

# Define the PPO configuration
config = (
    PPOConfig()
    .api_stack(
        enable_rl_module_and_learner=True,
        enable_env_runner_and_connector_v2=True,
    )
    .environment(
        env="Your Environment_Class name ",
        env_config={"
           your desired configuration(if needed)
"}
    )
    .training(
        lr=selected_lr,  # Learning rate
        entropy_coeff= selected_entropy  # Encourage exploration with entropy regularization
    )
    .env_runners(num_env_runners = number_of_env)  # Number of parallel environments
)

Mar 04 '25 09:03 AlizargarDeu

I think your problem is something between the trained model and your environment. Why it should give you a vector with 4 when you pass your environment with an action_space 2??!

may it helps you for training :

Define the PPO configuration

config = ( PPOConfig() .api_stack( enable_rl_module_and_learner=True, enable_env_runner_and_connector_v2=True, ) .environment( env="Your Environment_Class name ", env_config={" your desired configuration(if needed) "} ) .training( lr=selected_lr, # Learning rate entropy_coeff= selected_entropy # Encourage exploration with entropy regularization ) .env_runners(num_env_runners = number_of_env) # Number of parallel environments )

After several hours of staring at the screen, I figured it out. The output was a Gaussian distribution with mean and standard deviation for both the action spaces. Hence the 2x the number of values

Mar 05 '25 15:03 AdithyaRaman

After several hours of staring at the screen, I figured it out. The output was a Gaussian distribution with mean and standard deviation for both the action spaces. Hence the 2x the number of values

Hello @AdithyaRaman,

could you please share the code you used to extract the gaussian distribution tensor, and how you converted it into an action that the env can understand? I tried the following code below, but the agent is not behaving as expected.

import pathlib
import torch
import numpy as np
import ray
from ray.tune.registry import register_env
from raylib.custom_env import SimpleEnv
from ray.rllib.core.rl_module import RLModule
from torch.distributions import Normal



# Initialize Ray
ray.init(ignore_reinit_error=True)

# Register the environment
register_env("my_custom_env", lambda config: SimpleEnv())

# Define the checkpoint directory
checkpoint_dir = "/models_sac_1"  

# Load the RLModule  from the checkpoint
rl_module = RLModule.from_checkpoint(
    pathlib.Path(checkpoint_dir) / "learner_group" / "learner" / "rl_module"
)["default_policy"]

# Test the agent in the environment
env = SimpleEnv()
obs, _ = env.reset()

done = False
truncated = False
total_reward = 0

# Run the agent in the environment
while not (done or truncated):
    env.render()  # Render the environment

    # Compute the next action using the RLModule
    torch_obs_batch = torch.from_numpy(np.array([obs])).to(torch.float32)  # Ensure dtype is float32
    action_logits = rl_module.forward_inference({"obs": torch_obs_batch})  # Perform inference
    action_dist_inputs = action_logits['action_dist_inputs']
    log_std, mean = torch.chunk(action_dist_inputs, 2, dim=-1)

# Convert log_std to std
    std = torch.exp(log_std)

# Sample action from the Gaussian distribution
    action = mean + std * torch.randn_like(std)

    print(action)
    action_np = action.detach().cpu().numpy() # This will be the vector for the env.step() function

    obs, reward, done, truncated, info = env.step(action_np[0])  # Take the step in the environment
    total_reward += reward

print(f"Total reward: {total_reward}")

# Clean up
ray.shutdown()

The agent performed very well during training, and always reached it's destination, so I assume that the problem is not my reward function. Thank you very much for your time!

Regards

Mar 16 '25 09:03 alexhasenclever

Does that really work? I have been looking for viable work-arounds.

yes, it really works, continue to train one time and get the result.😵‍💫

Mar 24 '25 15:03 sebastian-cao

Hello @alexhasenclever Is this updated script that you got good results? how was your acrion_space ? 2? discrete or continues? I tried with 6 discrete actions and got error!

Apr 10 '25 08:04 AlizargarDeu

ray ray copied to clipboard

[RLlib] Attribute error when trying to compute action after training Multi Agent PPO with New API Stack

What happened + What you expected to happen

Versions / Dependencies

Reproduction script

Issue Severity

`

Create only the neural network (RLModule) from our checkpoint.

Define the PPO configuration

ray
ray copied to clipboard