SeqGAN strange behavior of reward signal

Hi,

I'm getting a very strange behavior that I can't explain when running your code, and I'm interested to know if you can reproduce, or help me understand. I wanted to see if I can calculate a better reward, and along the way I tested with fixed values. Meaning, I replaced the implementation of rollout.py:get_reward() with:

rewards = np.zeros((64,20)) rewards.fill(2) return rewards

Surprisingly, it had the generator achieve faster convergence onto a lower value of the test error (see attached log). I got pretty much the same behavior when I used rewards uniformly sampled from [0,1]. I'm not sure what to make of it..

Also, a question: Why is the rollout network lagging behind the generator (default value is 0.8)? Don't we want in theory to sample from the latest generator?

fixed-seqgan-log.txt

Jul 15 '17 02:07 lederg

Very interesting experiments!

Jan 31 '18 01:01 eduOS

interesting. Did you figured out why?

Apr 07 '19 12:04 MichaelZhouwang