reinforcement-learning icon indicating copy to clipboard operation
reinforcement-learning copied to clipboard

MC Control with Epsilon-Greedy Policies ---Epsilon Value and Best Action prob error

Open hardik-kansal opened this issue 2 years ago • 2 comments

  • Epsilon value is not decreased hyperbolically At end of each episode ,there should be epsilion=epsilon/1.1

hardik-kansal avatar Dec 23 '23 14:12 hardik-kansal

Ensure proper epsilon decay by verifying correct division by 1.1, initialization, data types, and episode end triggers. Adjust decay rate if necessary .

AbhinavSharma07 avatar Apr 14 '24 10:04 AbhinavSharma07

If you are referring to the 2nd exercise of the Monte Carlo methods, https://github.com/dennybritz/reinforcement-learning/tree/master/MC (Implement the on-policy first-visit Monte Carlo Control algorithm, https://github.com/dennybritz/reinforcement-learning/blob/master/MC/MC%20Control%20with%20Epsilon-Greedy%20Policies%20Solution.ipynb), then there's no need to implement an epsilon decay.

The intention is to refine the state-action values of an epsilon-greedy policy toward the optimal policy (it won't become optimal because it's a soft policy). The requirement is to use a soft policy that approximates to the optimal greedy policy over its action-state values. The epsilon-greedy policy satisfies that requirement, even with a constant epsilon.

Although in a real world scenario an epsilon value with a decay would normally be better (especially in stationary environments, like the environment used in the exercise, blackjack), there's no need for use decay in this exercise. Actually, I think it's better to not include decay here, because in the book (Chapter 5) it specifies just an epsilon-greedy policy without decay, so it conforms more with the book, and focuses more on the control algorithm itself, instead of the possible policies that could be used (like Decay Schedules for 𝜖, Upper Confidence Bound (UCB), Boltzmann Exploration (Softmax), etc.), even if they would be a better fit and converge faster into the optimal policy.

lucasbasquerotto avatar Jul 30 '24 16:07 lucasbasquerotto