reinforcement-learning Questionable result in Gamblers Problem Solution

Questionable result in Gamblers Problem Solution

Open bminixhofer opened this issue 5 years ago • 11 comments

I encountered an interesting issue in the Gamblers Problem notebook .

The graph of capital vs policy looks like this with p_h = 0.4:

In the book (on page 66), we see how the graph is supposed to look. The general notion appears to be the same, but the strange spikes in the notebook do not appear:

If we reduce theta in value_iteration_for_gamblers from 0.0001 to 1e-30 the spikes appear to become less, but they are still significant:

Does anyone know why this happens? Thanks!

Aug 20 '18 20:08 bminixhofer

The book does mention that there is a family of ideal solutions, not just one. Perhaps this is another valid solution from the family? I'm not too sure but it sounds plausible.

Sep 09 '18 11:09 puzzler10

I'm getting the exact book solution with p_h = 0.25:

But what's shown in the uploaded solution is more like what you're getting for p_h=0.4.

The difference in my code is that I do synchronous updates of V (only at the end of each state loop).

If you plot out the values of A[stake] , you'll see that in some cases there are multiple stakes for which A[s1] ~= A[s2], for example for s = 30:

So when you run np.argmax to get the policy, you could get several solutions but the function only returns the first one, or there is a small numerical difference which tips the scales towards some stake even though in the theoretical converged solution they are the same.

The difference in the final results is probably because of small differences in V[s] for lower states getting updated as the s loop runs. I think in principal both results are analytically correct.

Oct 02 '18 02:10 ArikVoronov

That makes sense. But, to be honest, I don't understand why both policies (the noisy one and the one from the book) are optimal. That might be answered by Exercise 4.8 from the book:

Exercise 4.8 
Why does the optimal policy for the gambler’s problem have such a curious
form? In particular, for capital of 50 it bets it all on one flip, but for capital of 51 it does
not. Why is this a good policy?

Does anyone know the answer to this exercise? Intuitively, I would think that the stake should be equal to s when s < 50, and afterwards equal to 100 - s, such that when you win the bet, you exactly reach your goal. The pattern with oscillations between 0 and ~12 with spikes at 25, 50 and 75 does not make much sense to me.

Oct 02 '18 06:10 bminixhofer

I think I do have one answer to that, you can check easily: Put the discount factor to ANYTHING lower than 1, and you get pretty much what you expected here : a pyramid with a maximum value at 50. That's because the algorithm will "try" to get to the goal with as few steps as possible, and that means hoping you'll win as much as you can at every step.

On the other hand, if the p_h is <0.5 and gamma =1 , the algorithm doesn't really mind taking its time to win, and so it tries to maximize value at every step: It's easiest to think about the first iteration of the value function: Assuming you initialize with V[s] = 0, on the first iteration, only steps >50 get any value, because only those steps can reach the reward in one step and get v[s] = 1p_h (=0.4, for example) Now, Let's say you're at step 80 - you can bet 20 - then you either win +1p_h value OR you lose, move to 60 and get extra 0.4*(1-p_h) value But if you're at step 70, you can bet 30 and win +1p_h reward OR you lose and get 0. On the other hand , you could bet 20 and then win +1p_h reward or lose and get +0.4*(1-p_h) If you lose more than you win p_h<0.5 , it's sometimes better to bet safely that you'll get another direct chance to go for the win (again, unless you're penalized by some discount).

In particular, at step 50, it doesn't matter if you lose 1 or lose 49, so you might as well go for as much as possible to maximize the potential value. on the other hand, on step 51 you could lose 1 then get back to step 50 and try again.

As I see it, the cyclical values are a balance between betting on winning and betting on losing.

Oct 02 '18 09:10 ArikVoronov

@ArikVoronov That has confused me for a long time, thx for your explanation.

Mar 28 '19 01:03 onetree1994

Why the figures of 'value estimate' and 'final policy' become so weird when the 'ph' is bigger than 0.5?

Apr 20 '19 12:04 Huixxi

The graph you're getting with the erroneous spikes in the first post are probably about floating point arithmetic. I was able to output the graph that looked just like the textbook by comparing values that are rounded to ten digits when taking the max over the sum for updating the value function

Sep 30 '19 22:09 sanakdag

The book gets the solution it does because it chooses the least amount to bet rather than the most (using argmax)

There's an alternative optimal solution that looks like an upside-down V (obtained by using argmax but choosing the highest of equal bets rather than the lowest).

And, there are mixes of those strategies. When you have an amount of money less than $50, a winning strategy is always to bet it all, but you can also use the value from the figure in the book. They key is to minimize the total number of bets (since you don't have an edge).

The original figure shown by bminixhofer is a combination of the two: as you can see it is a mixture of the book figure along with an upside-down V. As stated earlier, this is due to slightly different floating-point values for what should actually end up as equal values. Thus, sometimes that figure shows the lower of the two equally-valid bets and sometimes it shows the higher.

Feb 20 '20 23:02 nrhodes

The original figure shown by bminixhofer is actually correct, as there are a family of correct solutions.

Since the gambler doesn't have an edge, the optimal policy is to minimize the number of bets. With $51, optimal policy is either:

Bet $1 and try to get that to $50. If you lose it, bet the $50
Bet $49. If you lose it, you've got $2 left. Try to build it up to $100.

The book shows a policy that always shows the minimum optimal bet. Here's a figure that shows the maximum optimal bet:

Note that bminixhofer's figure is a mix between the two bets. That's likely caused, as sanakdag points out, because the two actions that should have identical value aren't exactly identical (off in one of the least significant digits). Thus, the figure is showing whichever action happens to be slightly higher. Rounding does address this issue.

Huixxi asks why the figure looks weird when ph is > 0.5. That's because the winning strategy is to maximize the number of bets. Thus, no matter what the capital stake, bet $1.

Feb 20 '20 23:02 nrhodes

The graph you're getting with the erroneous spikes in the first post are probably about floating point arithmetic. I was able to output the graph that looked just like the textbook by comparing values that are rounded to ten digits when taking the max over the sum for updating the value function

Wow! This was the issue I had. The plot in the book assumes that you choose the least amount when outcomes are equal, except that floating point issues sometimes make the larger bet amount better in the calculation. Thanks!

Aug 17 '20 07:08 umutisik

Can confirm the floating point issue - I rounded to 7 digits and now my graph looks like the book....

Aug 28 '20 00:08 tniccum21

reinforcement-learning reinforcement-learning copied to clipboard

Questionable result in Gamblers Problem Solution

reinforcement-learning
reinforcement-learning copied to clipboard