tianshou Some questions in recurrent-style SAC

When I tried to train Pendulum-v0 with a recurrent-style SAC, the policy didn't improve, while it worked fine with a MLP model. The curves of the training process in Tensorboard is shown below (the red curve is based on MLP and the green curve is based on LSTM). Could you explain the reasons for this?

Oct 25 '21 02:10 chocolate616

Did you only change the network structure instead of other hyperparameters like lr (those would be sensitive to reward curve)? Honestly speaking I haven't used RNN+SAC to run experiments :(

Oct 25 '21 13:10 Trinkle23897

Did you only change the network structure instead of other hyperparameters like lr (those would be sensitive to reward curve)? Honestly speaking I haven't used RNN+SAC to run experiments :(

Thanks for replying! I'm sure I only changed the network structure and stack-num (it is 5 for LSTM). When I print the action, I found that the output value is the same.

In addition, I've tried recurrent-style DDPG and found a similar phenomenon.

Oct 26 '21 01:10 chocolate616

@chocolate616 Hi, would you mind posting a complete minimal example for the recurrent-style SAC? I know it wasn't working as well as you hoped, but I am having trouble trying to get mine to run at all. Any reference would be much appreciated.

Nov 04 '21 22:11 HymnsForDisco

@chocolate616 Hi, would you mind posting a complete minimal example for the recurrent-style SAC? I know it wasn't working as well as you hoped, but I am having trouble trying to get mine to run at all. Any reference would be much appreciated.

You can refer to https://github.com/thu-ml/tianshou/blob/fc251ab0b85bf3f0de7b24c1c553cb0ec938a9ee/test/discrete/test_drqn.py

###

from tianshou.utils.net.continuous import RecurrentCritic, RecurrentActorProb

### nets
actor = RecurrentActorProb(layer_num=1, state_shape=args.state_shape, action_shape=args.action_shape,
                                   device=args.device, unbounded=True).to(args.device)
actor_optim = torch.optim.Adam(actor.parameters(), lr=args.actor_lr)
critic1 = RecurrentCritic(layer_num=1, state_shape=args.state_shape, action_shape=args.action_shape,
                                  device=args.device).to(args.device)
critic1_optim = torch.optim.Adam(critic1.parameters(), lr=args.critic_lr)
critic2 = RecurrentCritic(layer_num=1, state_shape=args.state_shape, action_shape=args.action_shape,
                                  device=args.device).to(args.device)
critic2_optim = torch.optim.Adam(critic2.parameters(), lr=args.critic_lr)

# collector
if args.training_num > 1:
    buffer = VectorReplayBuffer(args.buffer_size, len(train_envs), stack_num=args.stack_num)  #stack-num is time length
else:
    buffer = ReplayBuffer(args.buffer_size, stack_num=args.stack_num)

train_collector = Collector(policy, train_envs, buffer, exploration_noise=True)
test_collector = Collector(policy, test_envs)

Nov 12 '21 09:11 chocolate616

We used RNN+PPO in our custom environment, but the same problem counted as chocolate616 described. We can't find out what the problem is, but it seems that there's something wrong with Recurrent-style policy because the non-RNN policy works well. We tried to solve the problem, but it was difficult to solve. Hope you can pay attention to this problem.

In addition, you have done a great job, improved training speed and precision, thank you for your great work!

Nov 15 '21 10:11 fengye4242

Anyone got any RNN based algo to learn correctly?

Mar 04 '22 11:03 BFAnas

tianshou tianshou copied to clipboard

Some questions in recurrent-style SAC

tianshou
tianshou copied to clipboard