reinforcement-learning icon indicating copy to clipboard operation
reinforcement-learning copied to clipboard

Failing to converge with increase in grid-size (Grid World)

Open akileshbadrinaaraayanan opened this issue 7 years ago • 5 comments

If I increase both the HEIGHT and WIDTH from 5 to 10 keeping the obstacles and the final goal at the same position, Deep SARSA network doesn't seem to converge. What do you think is the problem? Should I increase the depth or dimensions of the hidden layer in actor and critic networks?

Thanks, Akilesh

akileshbadrinaaraayanan avatar Jun 27 '17 09:06 akileshbadrinaaraayanan

Hi,

I was running experiments with increased grid size and in some cases the action probability values become so skewed that one particular value is almost one and the rest very small (order of -20). This leads to zero cross entropy loss and basically the agent gets stuck (say it's in the top of the grid and action probability for UP is close to 1).

Any suggestions on how to overcome such situations?

akileshbadrinaaraayanan avatar Jul 11 '17 18:07 akileshbadrinaaraayanan

@Hyeokreal might be able to answer that for ya

keon avatar Jul 11 '17 21:07 keon

Which Algorithm do you using right now on increased grid world?

dnddnjs avatar Jul 12 '17 00:07 dnddnjs

I am using A2C.

Cross entropy becomes zero because: say action_prob [p1, p2] (here p1 is order of 10^(-20) that is close to 0, and p2 is close to 1) and advantages = [0, advantages], then cross entropy calculation becomes log(action_prob)*advantages = log(1) * advantages = 0

I am not doing any random exploration, just deciding the actions based on output of actor net. Discount_factor is 0.99. The advantage estimate becomes negative in this case.

akileshbadrinaaraayanan avatar Jul 12 '17 04:07 akileshbadrinaaraayanan

There is problem of exploration when using policy gradient. In DQN, agent can explore with probability of epsilon after convergence. In actor critic, If actor network is converged, then agent is hard to explore.

I think there are two options. One is just training with low learning rate. The other is add entropy of policy to the loss funtion of actor network. If you look at a3c agent, you will find out there is entropy term in loss function.

dnddnjs avatar Jul 12 '17 06:07 dnddnjs