rl
rl copied to clipboard
In the Doc “RECURRENT DQN: TRAINING RECURRENT POLICIES”
I managed to run the code, but during the process, I realized that the maximum STEP for each batch is only 50,
steps: 50, loss_val: 0.1930, action_spread: tensor([26, 24], device='cuda:0'): 18%|█▊ | 181450/1000000 [1:54:35<9:08:14, 24.88it/s]
I tried to output it
print(data[ "step_count"])
tensor([[ 0],
[ 1],
[ 2],
[ 3],
[ 4],
[ 5],
[ 6],
[ 7],
[ 8],
[ 0],
[ 1],
[ 2],
[ 3],
[ 4],
[ 5],
[ 6],
[ 7],
[ 8],
[ 9],
[10],
[11],
[12],
[ 0],
[ 1],
[ 2],
[ 3],
[ 4],
[ 5],
[ 6],
[ 7],
[ 8],
[ 0],
[ 1],
[ 2],
[ 3],
[ 4],
[ 5],
[ 6],
[ 7],
[ 8],
[ 0],
[ 1],
[ 2],
[ 3],
[ 4],
[ 5],
[ 6],
[ 7],
[ 8],
[ 9]], device='cuda:0')
next output is
tensor([[10],
[ 0],
[ 1],
[ 2],
[ 3],
[ 4],
[ 5],
[ 6],
[ 7],
[ 8],
[ 9],
[ 0],
[ 1],
[ 2],
[ 3],
[ 4],
[ 5],
[ 6],
[ 7],
[ 8],
[ 0],
[ 1],
[ 2],
[ 3],
[ 4],
[ 5],
[ 6],
[ 7],
[ 8],
[ 9],
[10],
[ 0],
[ 1],
[ 2],
[ 3],
[ 4],
[ 5],
[ 6],
[ 7],
[ 8],
[ 0],
[ 1],
[ 2],
[ 3],
[ 4],
[ 5],
[ 6],
[ 7],
[ 8],
[ 0]], device='cuda:0')
I've tried many times and it's the same pattern, that is to say, the accounting number will start again after each batch. I don't know why.
Using the 'is_init' key, I found that env is always reset on the second step of a batch