I am new to pyTorch, just cloned your codes and ran them, but got an error. I hope you to point me to the right direction to fix this issue.

More specifics:

used conda env with python 3.6
Ran 'run_a3c.py' with the default Breakout-v0 env till the end and ran 'python test_a3c.py --render --monitor --env Breakout-v0'
got the below error message -

=== File "test_a3c.py", line 71, in test(policy, args) File "test_a3c.py", line 25, in test p, v = policy(o) ... File "/home/john/anaconda3/envs/th/lib/python3.6/site-packages/torch/nn/functional.py", line 37, in conv2d return f(input, weight, bias) if bias is not None else f(input, weight) RuntimeError: expected a Variable argument, but got numpy.ndarray

Could you tell me what could be the issue(s) here ?

Many thanks,

John

Mar 06 '17 01:03 dylanthomas

torch.nn.module must take torch.Variable. But, policy(which is subclass of torch.nn.module) takes numpy.ndarray, so we have to convert numpy.ndarray to torch.Variable. I fixed this problem. See my commit 9e9fb687786a025061561c7260ba9b586e9ca4ce.

Mar 06 '17 02:03 rarilurelo

Many thanks.

On another note, when I ran Breakout-v0, the reward that I got after 10M steps was 30~40M. But shouldn't this be around 400 according to the DeepMind's paper ? I wonder where the difference is coming from... Any thoughts/ insight on this ?

Mar 06 '17 07:03 dylanthomas

There are some differences between my code and DeepMind's paper. My code is

no LSTM use
no gradient clipping
no hyper parameter tuning ( I couldn't find lr in the paper )

That's why the result was not good enough, I think.

Mar 06 '17 11:03 rarilurelo

Thank you for your reply. Two points --

On the param setting, are you aware of this wiki ( https://github.com/muupan/async-rl/wiki ) ?

On the performance issue of tensorflow implementation, have you seen this discussion ( https://github.com/dennybritz/reinforcement-learning/issues/30 It's on dqn, but the same issues are supposed to be the root cause on the A3C side as well )

Here cgel suggests the following are the key :

Important stuff:

Normalise input [0,1] Clip rewards [0,1] don't tf.reduce_mean the losses in the batch. Use tf.reduce_max initialise properly the network with xavier init use the optimizer that the paper uses. It is not same RMSProp as in tf

Has your code incorporated all the points above ?

Mar 08 '17 01:03 dylanthomas

@dylanthomas did you try running Breakout-v0 for longer than 10M timesteps to see if avg reward eventually got to >400? For example, it took Muupan's A3C https://github.com/muupan/async-rl#a3c-ff 20M timesteps to start getting to >400.

Mar 14 '17 07:03 ethancaballero

Not yet, but I will run this code for 20M to see if it goes up to 400. @ethancaballero

Mar 15 '17 00:03 dylanthomas

pytorch_a3c
pytorch_a3c copied to clipboard

expected a Variable arg but got numpy.ndarray error

pytorch_a3c pytorch_a3c copied to clipboard

expected a Variable arg but got numpy.ndarray error

pytorch_a3c
pytorch_a3c copied to clipboard