softqlearning icon indicating copy to clipboard operation
softqlearning copied to clipboard

Suboptimal policy

Open xli4217 opened this issue 7 years ago • 1 comments

I'm trying SQL on a simple manipulator reaching task, the agent quickly learns to get to the vicinity of the goal but then the learning curve plateaus and the agent never quite get to the goal. Some of my hyperparameters are

  • policy learning rate 0.0005
  • Q learning rate 0.001
  • reward scale 20
  • alpha 1.0

Is there something I can do to improve this? Thanks.

xli4217 avatar Nov 15 '18 22:11 xli4217

SQL learns maximum entropy policies, so that's why the optimal policy is stochastic. You can try for example annealing the temperature to zero, or shaping the reward function by making the reward much larger in the vicinity of the goal.

haarnoja avatar Nov 16 '18 17:11 haarnoja