softlearning icon indicating copy to clipboard operation
softlearning copied to clipboard

SAC Hyperparameters MountainCarContinuous-v0 - Env with deceptive reward

Open araffin opened this issue 5 years ago • 10 comments

Hello,

I've tried in vain to find suitable hyperparameters for SAC in order to solve MountainCarContinuous-v0.

Even with hyperparameter tuning (see "add-trpo" branch of rl baselines zoo), I was not able to solve it consistently (if during random exploration it finds the goal, then it will work, otherwise, it will be stuck in a local minima). I also encountered that issue when trying SAC on another environment with deceptive reward (bit flipping env, trying to apply HER + SAC, see here).

Did you manage to solve that problem? If so, what hyperparameters did you use?

Note: I am using the SAC implementation from stable-baselines that works pretty well on all others problems (but where the reward is dense).

araffin avatar Apr 20 '19 13:04 araffin

Hey @araffin, thanks for opening this issue! We've actually observed very similar reward-related problems with SAC recently. I don't remember ever running MountainCarContinuous-v0 myself, so I can't say whether I would expect that particular task to work out of the box or not, but I'm pretty consistently able to reproduce similar issue where adding a constant scalar to the rewards will make SAC learn much slower and in some special cases get stuck in local minima and not be able to solve the task at all.

Here's an example of a simple experiment with the (simulated) screw manipulation environment that we used in [1], where different constant added to the reward results in extremely different performance (in the figure, lower is better): image

Another thing we have noticed is that there's a noticeable difference in sparse reward setting between setting non-success/success reward to -1/0 vs. 0/1.

I've tried to alleviate the problems with the obvious solutions, such as simply normalizing the rewards in the environments, but none of the simple solutions don't seem to have the desired effect in general. For example, normalizing returns seemed to help in some cases but then fail on others.

One thing I briefly looked into that seemed promising, is the POP-ART normalization [2]. I have a simple prototype of it implemented at https://github.com/hartikainen/softlearning/compare/master...hartikainen:experiment/claw-costs-test-pop-art, however, there seems to be something wrong in the implementation because it completely breaks the algorithm even in simple cases. I probably don't have too much time to look into this at least in the next few weeks, but if you are (or anyone else is) interested in testing this out, I'd be happy to help e.g. with reproducing the problem.

PS. Awesome job implementing SAC in baselines! I was planning to do it myself at some point last year, but you got it out much faster :smile:

[1] HAARNOJA, Tuomas, et al. Soft actor-critic algorithms and applications. arXiv preprint arXiv:1812.05905, 2018. (https://arxiv.org/abs/1812.05905) [2] VAN HASSELT, Hado P., et al. Learning values across many orders of magnitude. In: Advances in Neural Information Processing Systems. 2016. p. 4287-4295. (http://papers.nips.cc/paper/6076-learning-values-across-many-orders-of-magnitude.pdf)

cc @avisingh599

hartikainen avatar Apr 20 '19 22:04 hartikainen

Hi @hartikainen ,

I finally managed to make it work on MountainCarContinuous by adding additional noise to the actions of the behavior policy, in the same fashion DDPG does it. However, this did not solve my other problems ^^#

Another thing we have noticed is that there's a noticeable difference in sparse reward setting between setting non-success/success reward to -1/0 vs. 0/1.

Interesting, I tested and saw the same behavior you described. With sparse rewards, SAC + HER manages to find the optimum in the 0/1 setting whereas it fails in the -1/0 setting.

Do you have an idea of where does this issue with reward offset may come from?

such as simply normalizing the rewards in the environments

how did you try to normalize them exactly? (I also tried that using running averages but it did not help)

I briefly looked into that seemed promising, is the POP-ART normalization

I think I have to take a look at this paper, not the first I see it mentioned. I will maybe try that later (I'm focusing on finishing and testing HER re-implementation for stable-baselines right now).

PS. Awesome job implementing SAC in baselines! I was planning to do it myself at some point last year, but you got it out much faster

your welcome =) In fact, the release of spinning up and this repo accelerated the implementation.

araffin avatar Apr 21 '19 10:04 araffin

Update: DDPG seems to suffer from the same issue with sparse reward but the other way around: it work in the -1/0 setting and fails in the 0/1 one. Using return normalization / pop-art did not help :/

araffin avatar Apr 21 '19 11:04 araffin

sac.zip

Hi, I've been trying to implement this for keras-rl but have not managed to get it to work. I'm not sure if there's an error in my code or if it's the environment/rewards that I am testing on - have so far only been testing on Pendulum, MountainCar, LunarLanding, and BipedalWalker.

With sparse rewards, SAC + HER manages to find the optimum in the 0/1 setting whereas it fails in the -1/0 setting.

I'm not sure if it's related to my implementation but I do see that my critic values go negative very very fast - since the actions chosen at the start always give negative results, the critic over-estimates negative values in the negative sense, leading to the explosion of discounted reward, which is a term in the critic's target, leading to further negative estimates etc., etc. It seems like this explosion happens before the action space is sufficiently explored, so the agent never finds good actions. This won't be the case for a 0/1 setting since the explosion will be in the "good" sense, reinforcing good behaviour. Let me know your thoughts on this.

yujia21 avatar May 27 '19 03:05 yujia21

Same issue here. Any thoughts?

Since it seems like an exploration problem, I'm currently trying to tune the temperature parameter $\alpha$. Larger $\alpha$ leads to better exploration in some cases, but not always; however, it does lead to bad final reward in any converged cases.

ritou11 avatar Mar 25 '20 04:03 ritou11

Hi @hartikainen ,

I finally managed to make it work on MountainCarContinuous by adding additional noise to the actions of the behavior policy, in the same fashion DDPG does it. However, this did not solve my other problems ^^#

Another thing we have noticed is that there's a noticeable difference in sparse reward setting between setting non-success/success reward to -1/0 vs. 0/1.

Interesting, I tested and saw the same behavior you described. With sparse rewards, SAC + HER manages to find the optimum in the 0/1 setting whereas it fails in the -1/0 setting.

Do you have an idea of where does this issue with reward offset may come from?

such as simply normalizing the rewards in the environments

how did you try to normalize them exactly? (I also tried that using running averages but it did not help)

I briefly looked into that seemed promising, is the POP-ART normalization

I think I have to take a look at this paper, not the first I see it mentioned. I will maybe try that later (I'm focusing on finishing and testing HER re-implementation for stable-baselines right now).

PS. Awesome job implementing SAC in baselines! I was planning to do it myself at some point last year, but you got it out much faster

your welcome =) In fact, the release of spinning up and this repo accelerated the implementation.

Can you remember the scale of additional noise? Or any ideas? thx

ritou11 avatar Mar 25 '20 04:03 ritou11

Hello,

You can find working hyperparameters in the rl zoo, the noise standard deviation is quite high (0.5 compared to "classic" values of 0.1-0.2 normally used)

araffin avatar Mar 25 '20 09:03 araffin

Hello,

You can find working hyperparameters in the rl zoo, the noise standard deviation is quite high (0.5 compared to "classic" values of 0.1-0.2 normally used)

Hi @araffin ,

Nice work there! I noticed you're using an automatic ent_coef with an OU noise of 0.5 to improve exploration. After a day of trying, I finally understand the sparse reward difficulty here and why you said it was even "deceptive" reward. In your previous reply:

Interesting, I tested and saw the same behavior you described. With sparse rewards, SAC + HER manages to find the optimum in the 0/1 setting whereas it fails in the -1/0 setting.

Did you mean MountainCarContinuous-v0 could be solved by SAC + HER? (or it just solves other sparse reward envs and we have to use action noise to explore here)

ritou11 avatar Mar 25 '20 10:03 ritou11

automatic ent_coef

this is just for convenience, the external noise scale is what makes things work.

Did you mean MountainCarContinuous-v0 could be solved by SAC + HER?

Ah no, I was talking about environments tailored for HER.

araffin avatar Mar 25 '20 10:03 araffin

Ah no, I was talking about environments tailored for HER.

That's too bad. I just briefly read about HER and was hoping HER to solve this.

But if we know the goal like HER does, we could implement reward shaping to lead the agent. In my limited numbers of experiments, an extra reward 0.1 * abs(goal - position) made SAC explore better. However, reward shaping changes the aim and should hinder the agent to explore in another direction. So I guess the improvement I saw was a coincidence......

Another thought is that we can improve the weight of less observed action-state pairs during the training (something like a continuous monte carlo search tree). I'll search for related papers and hopefully try this idea when I finish my project in hand.

ritou11 avatar Mar 25 '20 11:03 ritou11