tianshou icon indicating copy to clipboard operation
tianshou copied to clipboard

Some questions in recurrent-style SAC

Open chocolate616 opened this issue 3 years ago • 6 comments

When I tried to train Pendulum-v0 with a recurrent-style SAC, the policy didn't improve, while it worked fine with a MLP model. The curves of the training process in Tensorboard is shown below (the red curve is based on MLP and the green curve is based on LSTM). Could you explain the reasons for this?

image

image

image

chocolate616 avatar Oct 25 '21 02:10 chocolate616

Did you only change the network structure instead of other hyperparameters like lr (those would be sensitive to reward curve)? Honestly speaking I haven't used RNN+SAC to run experiments :(

Trinkle23897 avatar Oct 25 '21 13:10 Trinkle23897

Did you only change the network structure instead of other hyperparameters like lr (those would be sensitive to reward curve)? Honestly speaking I haven't used RNN+SAC to run experiments :(

Thanks for replying! I'm sure I only changed the network structure and stack-num (it is 5 for LSTM). When I print the action, I found that the output value is the same.

In addition, I've tried recurrent-style DDPG and found a similar phenomenon.

chocolate616 avatar Oct 26 '21 01:10 chocolate616

@chocolate616 Hi, would you mind posting a complete minimal example for the recurrent-style SAC? I know it wasn't working as well as you hoped, but I am having trouble trying to get mine to run at all. Any reference would be much appreciated.

HymnsForDisco avatar Nov 04 '21 22:11 HymnsForDisco

@chocolate616 Hi, would you mind posting a complete minimal example for the recurrent-style SAC? I know it wasn't working as well as you hoped, but I am having trouble trying to get mine to run at all. Any reference would be much appreciated.

You can refer to https://github.com/thu-ml/tianshou/blob/fc251ab0b85bf3f0de7b24c1c553cb0ec938a9ee/test/discrete/test_drqn.py

###

from tianshou.utils.net.continuous import RecurrentCritic, RecurrentActorProb

### nets
actor = RecurrentActorProb(layer_num=1, state_shape=args.state_shape, action_shape=args.action_shape,
                                   device=args.device, unbounded=True).to(args.device)
actor_optim = torch.optim.Adam(actor.parameters(), lr=args.actor_lr)
critic1 = RecurrentCritic(layer_num=1, state_shape=args.state_shape, action_shape=args.action_shape,
                                  device=args.device).to(args.device)
critic1_optim = torch.optim.Adam(critic1.parameters(), lr=args.critic_lr)
critic2 = RecurrentCritic(layer_num=1, state_shape=args.state_shape, action_shape=args.action_shape,
                                  device=args.device).to(args.device)
critic2_optim = torch.optim.Adam(critic2.parameters(), lr=args.critic_lr)

# collector
if args.training_num > 1:
    buffer = VectorReplayBuffer(args.buffer_size, len(train_envs), stack_num=args.stack_num)  #stack-num is time length
else:
    buffer = ReplayBuffer(args.buffer_size, stack_num=args.stack_num)

train_collector = Collector(policy, train_envs, buffer, exploration_noise=True)
test_collector = Collector(policy, test_envs)

chocolate616 avatar Nov 12 '21 09:11 chocolate616

We used RNN+PPO in our custom environment, but the same problem counted as chocolate616 described. We can't find out what the problem is, but it seems that there's something wrong with Recurrent-style policy because the non-RNN policy works well. We tried to solve the problem, but it was difficult to solve. Hope you can pay attention to this problem.

In addition, you have done a great job, improved training speed and precision, thank you for your great work!

fengye4242 avatar Nov 15 '21 10:11 fengye4242

Anyone got any RNN based algo to learn correctly?

BFAnas avatar Mar 04 '22 11:03 BFAnas