ARS Returns 0 actions on evaluation env
Hi! During training of the ARS model we obtain valid actions but during evaluation we obtain actions of 0. Both the evaluation and train environment are the same except for using different but similar data. Could this be a bug? model.predict simply returns 0 actions.
We have an action space of the following shape: self.action_space = gym.spaces.Box(low=-1, high=1, shape=(198,)) during evaluation the predict returns 0 actions, but during training everything seems fine. Even when using exactly the same environments for training and evaluation we see this behaviour. We use a MLP policy. The features are the same shape during evaluation and training. There must be something different during evaluation and training.
I am seeing similar behaviour with the RecurrentPPO implementation. I believe the issue comes from model.load().
I am using a custom environment but I have adapted the RecurrentPPO example to replicate the bug:
import numpy as np
import torch as th
from sb3_contrib import RecurrentPPO
model = RecurrentPPO("MlpLstmPolicy", "CartPole-v1", verbose=1)
model.learn(10000)
model.save("recurrent_ppo_issue")
# Delete model to show effect of loading
del model
# Load model and evaluate
model = RecurrentPPO("MlpLstmPolicy", "CartPole-v1", verbose=1)
model.load("recurrent_ppo_issue")
env = model.get_env()
obs = env.reset()
num_envs = 1
lstm_states = np.concatenate([np.zeros(model.policy.lstm_shape) for _ in range(num_envs)], axis=1)
lstm_states = (lstm_states, lstm_states)
episode_starts = np.ones((num_envs,), dtype=bool)
while True:
# Convert inputs to tensors
obs, vectorized_env = model.policy.obs_to_tensor(obs)
lstm_states = th.tensor(lstm_states[0]).float().to(model.device), th.tensor(lstm_states[1]).float().to(model.device)
episode_starts = th.tensor(episode_starts).float().to(model.device)
# Calculate action distribution
action_dist, lstm_states = model.policy.get_distribution(obs, lstm_states, episode_starts)
# Get deterministic action and entropy
action = action_dist.mode().item()
action_ent = action_dist.entropy().item()
print(f"Action Distribution Entropy: {action_ent}")
print(f"Action: {action}")
# Take action
obs, rewards, dones, info = env.step([action])
episode_starts = dones
env.render()
You can see that after loading the trained model, the entropy of the action distribution is always around 0.69.
Apologies, just found this issue DLR-RM/stable-baselines3#683. The correct way to load a saved model is model = RecurrentPPO.load() - I should have read the docs more closely! :)
Hello,
model we obtain valid actions but during evaluation we obtain actions of 0. Both the evaluation and train environment are the same except for using different but similar data.
could you please fill in the custom env issue template (providing a minimal code that mimics the inputs/outputs of the env): https://github.com/DLR-RM/stable-baselines3/blob/master/.github/ISSUE_TEMPLATE/custom_env.md
and could you also provide a minimal example that shows how you train and evaluate the model?
and did you check using the deterministic flag?