strange behavior of reward signal
Hi,
I'm getting a very strange behavior that I can't explain when running your code, and I'm interested to know if you can reproduce, or help me understand. I wanted to see if I can calculate a better reward, and along the way I tested with fixed values. Meaning, I replaced the implementation of rollout.py:get_reward() with:
rewards = np.zeros((64,20)) rewards.fill(2) return rewards
Surprisingly, it had the generator achieve faster convergence onto a lower value of the test error (see attached log). I got pretty much the same behavior when I used rewards uniformly sampled from [0,1]. I'm not sure what to make of it..
Also, a question: Why is the rollout network lagging behind the generator (default value is 0.8)? Don't we want in theory to sample from the latest generator?
Very interesting experiments!
interesting. Did you figured out why?