adventures-in-ml-code
adventures-in-ml-code copied to clipboard
Bugfix policy gradinet reinforce tf2
Hi,
there is a bg in policy_gradient_reinforce_tf2.py at line 39.
loss = network.train_on_batch(states, discounted_rewards)
to fix this I made two changes,
- one_hot_encode the actions
one_hot_encode = np.array([[1 if a==i else 0 for i in range(2)] for a in actions]) - pass the discounted rewards using 'sample_weight' parameter of 'categorical_crossentropy' loss function
I think it also solves issues #26 #27 #28
I tested it gym.make("CartPole-v0")
It converged in 2000 episodes!
Episode: 1961, Reward: 200.0, avg loss: 0.01056
Episode: 1962, Reward: 200.0, avg loss: 0.02165
Episode: 1963, Reward: 200.0, avg loss: -0.04293
Episode: 1964, Reward: 200.0, avg loss: -0.00953
Episode: 1965, Reward: 200.0, avg loss: 0.02787
Episode: 1966, Reward: 200.0, avg loss: 0.00205
Episode: 1967, Reward: 200.0, avg loss: 0.01984
Episode: 1968, Reward: 200.0, avg loss: 0.00307
Episode: 1969, Reward: 200.0, avg loss: -0.03621
Episode: 1970, Reward: 200.0, avg loss: -0.02112
Episode: 1971, Reward: 200.0, avg loss: -0.00132
Episode: 1972, Reward: 200.0, avg loss: 0.02377
Episode: 1973, Reward: 200.0, avg loss: 0.02295
Episode: 1974, Reward: 200.0, avg loss: -0.01884
Episode: 1975, Reward: 200.0, avg loss: 0.02013
Episode: 1976, Reward: 200.0, avg loss: 0.02265
Episode: 1977, Reward: 200.0, avg loss: 0.00097
Episode: 1978, Reward: 200.0, avg loss: -0.03959
Episode: 1979, Reward: 200.0, avg loss: 0.00527
Episode: 1980, Reward: 200.0, avg loss: 0.02360
Episode: 1981, Reward: 200.0, avg loss: 0.03568
Episode: 1982, Reward: 200.0, avg loss: 0.00684
Episode: 1983, Reward: 200.0, avg loss: 0.00912
Episode: 1984, Reward: 200.0, avg loss: -0.03238
Episode: 1985, Reward: 200.0, avg loss: 0.03891
Episode: 1986, Reward: 200.0, avg loss: 0.01156
Episode: 1987, Reward: 200.0, avg loss: 0.04099
Episode: 1988, Reward: 200.0, avg loss: -0.00574
Episode: 1989, Reward: 200.0, avg loss: 0.01317
Episode: 1990, Reward: 200.0, avg loss: 0.00885
Episode: 1991, Reward: 200.0, avg loss: 0.02338
Episode: 1992, Reward: 200.0, avg loss: 0.00069
Episode: 1993, Reward: 200.0, avg loss: 0.01195
Episode: 1994, Reward: 200.0, avg loss: 0.02862
Episode: 1995, Reward: 200.0, avg loss: -0.00214
Episode: 1996, Reward: 200.0, avg loss: 0.01396
Episode: 1997, Reward: 200.0, avg loss: -0.01529
Episode: 1998, Reward: 200.0, avg loss: 0.01859
Episode: 1999, Reward: 200.0, avg loss: 0.02944