stable-baselines3-contrib ARS Returns 0 actions on evaluation env

Hi! During training of the ARS model we obtain valid actions but during evaluation we obtain actions of 0. Both the evaluation and train environment are the same except for using different but similar data. Could this be a bug? model.predict simply returns 0 actions.

Jan 25 '22 10:01 dylanprins

We have an action space of the following shape: self.action_space = gym.spaces.Box(low=-1, high=1, shape=(198,)) during evaluation the predict returns 0 actions, but during training everything seems fine. Even when using exactly the same environments for training and evaluation we see this behaviour. We use a MLP policy. The features are the same shape during evaluation and training. There must be something different during evaluation and training.

Jan 25 '22 10:01 dylanprins

I am seeing similar behaviour with the RecurrentPPO implementation. I believe the issue comes from model.load().

I am using a custom environment but I have adapted the RecurrentPPO example to replicate the bug:

import numpy as np
import torch as th

from sb3_contrib import RecurrentPPO

model = RecurrentPPO("MlpLstmPolicy", "CartPole-v1", verbose=1)
model.learn(10000)
model.save("recurrent_ppo_issue")

# Delete model to show effect of loading
del model

# Load model and evaluate
model = RecurrentPPO("MlpLstmPolicy", "CartPole-v1", verbose=1)
model.load("recurrent_ppo_issue")

env = model.get_env()
obs = env.reset()

num_envs = 1
lstm_states = np.concatenate([np.zeros(model.policy.lstm_shape) for _ in range(num_envs)], axis=1)
lstm_states = (lstm_states, lstm_states)
episode_starts = np.ones((num_envs,), dtype=bool)

while True:
    # Convert inputs to tensors
    obs, vectorized_env = model.policy.obs_to_tensor(obs)
    lstm_states = th.tensor(lstm_states[0]).float().to(model.device), th.tensor(lstm_states[1]).float().to(model.device)
    episode_starts = th.tensor(episode_starts).float().to(model.device)

    # Calculate action distribution
    action_dist, lstm_states = model.policy.get_distribution(obs, lstm_states, episode_starts)

    # Get deterministic action and entropy
    action = action_dist.mode().item()
    action_ent = action_dist.entropy().item()
    print(f"Action Distribution Entropy: {action_ent}")
    print(f"Action: {action}")

    # Take action
    obs, rewards, dones, info = env.step([action])
    episode_starts = dones
    env.render()

You can see that after loading the trained model, the entropy of the action distribution is always around 0.69.

Feb 09 '22 21:02 frasermcghan

Apologies, just found this issue DLR-RM/stable-baselines3#683. The correct way to load a saved model is model = RecurrentPPO.load() - I should have read the docs more closely! :)

Feb 10 '22 13:02 frasermcghan

Hello,

model we obtain valid actions but during evaluation we obtain actions of 0. Both the evaluation and train environment are the same except for using different but similar data.

could you please fill in the custom env issue template (providing a minimal code that mimics the inputs/outputs of the env): https://github.com/DLR-RM/stable-baselines3/blob/master/.github/ISSUE_TEMPLATE/custom_env.md

and could you also provide a minimal example that shows how you train and evaluate the model? and did you check using the deterministic flag?

Feb 24 '22 12:02 araffin