imitation
imitation copied to clipboard
Got an unexpected keyword argument 'use_sde' when passing behavioural cloning policy to PPO from SB3
Bug description
Hello,
I want to pass the policy learned from behavioural cloning in imitation library to PPO, I thought it would be successful since they are both from ActorCriticPolicy class, however it doesn't work as I expected.
Steps to reproduce
from stable_baselines3 import PPO
from imitation.algorithms import bc
bc_trainer = bc.BC(
observation_space=env.observation_space,
action_space=env.action_space,
device='cuda',
policy=bc.reconstruct_policy(policy_path, device='cuda'),
rng=rng,
)
model = PPO(policy=bc_trainer.policy, env=env, verbose=1, device = 'cuda')
The error is:
Traceback (most recent call last):
File "agent/main.py", line 142, in
Environment
- Operating system and version: Ubuntu 20.04.6 LTS
- Python version: 3.8.0
- Pytorch 1.13.0
- Imitation 0.4.0
- Stable base line 1.8.0
- Gym 0.21.0
Hi, I had the same error trying to retrain a policy with PPO after Behaviour Cloning. Actually, the problem is here :
model = PPO(policy=bc_trainer.policy, env=env, verbose=1, device = 'cuda')
because the policy
argument expects a class (or alternatively a string). The error message is unclear since it is raised by PyTorch.
So when instantiating PPO with sb3, you should pass the policy class you want to use (which should inherit from ActorCriticPolicy). For example :
model = PPO(policy=ActorCriticPolicy, env=env,verbose=1, device = 'cuda')
This should work for instantiating the PPO. However, I am not sure how you should load the pre-trained policy, I could not find the right way to do it in stable-baselines3 (I tried model.policy = bc_trainer.policy
but not sure if it works properly).
Hope it helps somehow. Let me know if you find the right way to load a pre-trained policy with PPO algorithm 👍
For those who stumble across this issue, the load_from_vector
method seems to work:
pretrained_policy = ActorCriticPolicy.load("/path/")
model = PPO(ActorCriticPolicy, env)
model.policy.load_from_vector(pretrained_policy.parameters_to_vector())
model.learn(total_timesteps=100_000, reset_num_timesteps=False)
I had issue with saving and loading BC models and the below worked for me
from stable_baselines3.common import policies
# Saving
bc_model = bc.BC(
observation_space=venv.observation_space,
action_space=venv.action_space,
demonstrations=transitions_custom,
rng=rng,
)
bc_model.policy.save('models/test/model.zip')
# Loading
pretrained_policy = policies.ActorCriticPolicy.load('models/test/model.zip')
bc_model_reloaded = bc.BC(
observation_space=venv.observation_space,
action_space=venv.action_space,
demonstrations=transitions_custom,
rng=rng,
policy = pretrained_policy
)
I followed the above method mentioned by @yojul to load a BC model in SB3. But when I retrain the model in SB3 and then save it and try to reload it using PPO.load like below. I am getting an error of shape mismatch when copying weights. I am guessing this is due to the difference between imitation.policies.base.FeedForward32Policy and stable_baselines3 ActorCriticPolicy.
Can @AlexGisi, @yojul or @JkAcktuator share how you overcome this issue ?
pretrained_policy= policies.ActorCriticPolicy.load("imitation_bc_model.zip")
model = PPO(policy=policies.ActorCriticPolicy,env=env)
model.policy = pretrained_policy
model.learn(total_timesteps=100_000)
model.save("sb3_model.zip")
del model
model = PPO.load(("sb3_model.zip") # this throws an error
Hi @saeed349!. Here is my code.
dense_rollouts = rollout.rollout(
dense_expert,
DummyVecEnv([lambda: RolloutInfoWrapper(dense_env)]),
rollout.make_sample_until(min_timesteps=None, min_episodes=250),
rng=dense_rng,
)
dense_transitions = rollout.flatten_trajectories(dense_rollouts)
dense_bc = CustomBC(
observation_space=dense_env.observation_space,
action_space=dense_env.action_space,
policy = dense_expert.policy,
demonstrations=dense_transitions,
rng=dense_rng,
device = 'cuda',
tensorboard_log = f'/home/cai/Desktop/PILRnav/runs/dense_bc'
)
dense_bc.train(n_epochs=10)
dense_bc.policy.save("/home/cai/Desktop/PILRnav/weight/dense_bc")
dense_ppo = PPO(policy='MlpPolicy',
env=dense_env,
policy_kwargs = policy_kwargs,
use_sde = False,
batch_size = 64,
n_epochs = 7,
learning_rate= 0.0004,
tensorboard_log= f'/home/cai/Desktop/PILRnav/runs/dense_ppo' ,
verbose= 1 )
dense_ppo.policy = dense_ppo.policy.load("/home/cai/Desktop/PILRnav/weight/dense_bc")
dense_ppo.learn(MAX_ITER)
dense_ppo.policy.save("/home/cai/Desktop/PILRnav/weight/dense_ppo")
I hope that it will helpful.