evolution-strategies-starter icon indicating copy to clipboard operation
evolution-strategies-starter copied to clipboard

Doesn't work for continuous_mountain_car

Open joyousrabbit opened this issue 7 years ago • 6 comments

Hello, the algo doesn't work for continuous_mountain_car, because it's reward is -pow(action[0],2)*0.1. What means, the car's initial state is a local max reward, all the exploration will decrease the reward and cannot get evoluated.

Of course, if the car can explore the final solution by one try, it will work. But the probability is negligible.

How do you handle such local max initial state issue???

joyousrabbit avatar May 10 '17 17:05 joyousrabbit

What do you mean by Of course, if the car can explore the final solution by one try, it will work. . I think that if it finds good solution (Reaching the final state) by accident then update in weights will be too small anyway as most of the population will want to keep the policy "Do nothing" . Correct me if I'm wrong but I think that for this experiment you would have to change the way in which policy weights are updated to give more value to much better results and ignore the rest, and you would have to increase the noise so it's possible to find good policy by adding noise to policy that does nothing.

This example is quite hard. I managed to get good results for discrete version (MountainCar-v0) but no success for this one.

PatrykChrabaszcz avatar May 11 '17 14:05 PatrykChrabaszcz

@PatrykChrabaszcz Hello, after the solution is found quickly, the new weights will all be based on that solution.

joyousrabbit avatar May 11 '17 14:05 joyousrabbit

I don't see how one proper solution would drag the weights for the current policy such that it makes it more probable to draw more policies that reach final state in the next generation (for this enviroment). Influence from policies doing nothing will be much bigger when you use current default updating rule.

Maybe you mean initializing current policy (by accident) such that big part of the first population reaches the goal state.

PatrykChrabaszcz avatar May 11 '17 15:05 PatrykChrabaszcz

@PatrykChrabaszcz No, whenever it reachs the goal state, the influence will be big and immediate to the following biased and random weights. Because it's reward is huge compared with other opponents of doing nothing.

joyousrabbit avatar May 11 '17 15:05 joyousrabbit

Reward might be huge but by default if I understand correctly it uses weighted average to update parameters. But the weights in this average are from <-0.5, 0.5> centered_rank . So if there is only one good solution in this population it will be counted as 0.5 but the next one assuming for example population of size 100 will be counted as 0.49. That's why I said you could change the way those weights are updated so it gives this good solution higher importance. Am I right?

PatrykChrabaszcz avatar May 11 '17 17:05 PatrykChrabaszcz

It's not average. It's only based on (R_positive_rank-R_negative_rank)/number_of_rewards, so the huge reward = 1, the tiny reward is 0.0000001. They are independent.

On 11 May 2017 at 19:22, Patryk Chrabaszcz [email protected] wrote:

Reward might be huge but by default if I understand correctly it uses weighted average to update parameters. But the weights in this average are from <-0.5, 0.5> centered_rank . So if there is only one good solution in this population it will be counted as 0.5 but the next one assuming for example population of size 100 will be counted as 0.49. That's why I said you could change the way those weights are updated so it gives this good solution higher importance. Am I right?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/openai/evolution-strategies-starter/issues/9#issuecomment-300858409, or mute the thread https://github.com/notifications/unsubscribe-auth/ARFboFbmvTzAs32_hOB-s3yY0in5QaVvks5r40PegaJpZM4NW9SV .

joyousrabbit avatar May 11 '17 17:05 joyousrabbit