parallel-trpo Add Python 3 compatibility

Work in progress. It's running but not training (seems to not update the policy).

Oct 02 '16 22:10 danijar

What results are you getting when running it? It seems to be training for me with simply python main.py

-------- Iteration 0 ----------
Total time: 0.01 mins
408
0.0221478324383
4.70614836553
KL between old and new distribution:      0.000954865
Timesteps:                                19992
Entropy:                                  2.79114
Surrogate loss:                           -0.00904811
Average sum of rewards per episode:       -107.9788759
-------- Iteration 1 ----------
Total time: 0.15 mins
408
0.0288314968348
5.36949688842
KL between old and new distribution:      0.000983955
Timesteps:                                19992
Entropy:                                  2.7493
Surrogate loss:                           -0.0102529
Average sum of rewards per episode:       -103.852836985
-------- Iteration 2 ----------
Total time: 0.29 mins
408
0.0303251892328
5.50683114257
KL between old and new distribution:      0.000986047
Timesteps:                                19992
Entropy:                                  2.70699
Surrogate loss:                           -0.0106957
Average sum of rewards per episode:       -97.9895186744
-------- Iteration 3 ----------
Total time: 0.42 mins
408
0.0265925377607
5.15679529948
KL between old and new distribution:      0.000955042
Timesteps:                                19992
Entropy:                                  2.66511
Surrogate loss:                           -0.00985545
Average sum of rewards per episode:       -94.976103207
-------- Iteration 4 ----------
Total time: 0.56 mins
408
0.0278401058167
5.27637241073
KL between old and new distribution:      0.000956588
Timesteps:                                19992
Entropy:                                  2.62335
Surrogate loss:                           -0.0100624
Average sum of rewards per episode:       -90.8704326089
-------- Iteration 5 ----------
Total time: 0.70 mins
408
0.0324200838804
5.69386370406
KL between old and new distribution:      0.000974943
Timesteps:                                19992
Entropy:                                  2.58171
Surrogate loss:                           -0.0109182
Average sum of rewards per episode:       -89.9082363716
-------- Iteration 6 ----------
Total time: 0.84 mins
408
0.0281992703676
5.31029851963
KL between old and new distribution:      0.000974425
Timesteps:                                19992
Entropy:                                  2.53968
Surrogate loss:                           -0.0102566
Average sum of rewards per episode:       -86.2517944341

Oct 03 '16 00:10 kvfrans

I'm still working in python 2.7, but with your updated code. Most likely there is a portion of the code that doesn't run properly in python 3.

In ilyasu's python3 implementation, there's a different method of creating the dictionary during rollouts: (https://github.com/ilyasu123/trpo/blob/master/utils.py, check the function rollouts_contin). Maybe that could be a cause?

dict2(obs = np.concatenate(np.expand_dims(obs, 0)),
                             action_dists_mu = np.concatenate(action_dists_mu),
                             action_dists_logstd = np.concatenate(action_dists_logstd),
                             rewards = np.array(rewards),
                             actions =  np.array(actions))

Oct 03 '16 00:10 kvfrans

Thanks, that's work a try. Otherwise I think it's likely due to true division which is default even for integers in Python 3. My output is that the average score stays the same and the KL between the old and the new policy is always zero. Will look into it in the next days. If you have the time, you could add from __future__ import division to the beginning of all relevant files and see if that causes the same issue under Python 2.

Oct 03 '16 08:10 danijar

It appears that the ilyasu/trpo has the same issue (it's only designed for python2, and gets zero KL divergence in python3).

I tried from __future__ import division, but couldn't get it to replicate the issue in python2. @danijar or @kvfrans - Do either of you have other ideas?

Jun 19 '17 16:06 nottombrown

I guess the next step is figuring out if the issue lies on the numpy side or the Tensorflow side. Maybe try logging the observation/action pairs and seeing if there is any difference?

Jun 19 '17 18:06 kvfrans

My guess is that the problem is with a pure-Python division, or maybe a Numpy division. TensorFlow always requires specified types to divide tensors. The only behavior change for TensorFlow I'm aware of is dividing a tensor by a Python integer automatically converts the integer to a constant float tensor in Python, which caused an error in Python 2. But that's only adding behavior that wasn't available in Python 2. However, there might be any other TensorFlow related quirks that I'm not aware of.

Jun 19 '17 19:06 danijar

For the record, the problem is with the functionality of the how list comprehensions are made in Python3. SetFromFlat should become:

class SetFromFlat(object):

    def __init__(self, session, var_list):
        self.session = session
        assigns = []
        shapes = list(map(var_shape, var_list)) # note, here is the needed change. 
        total_size = sum(np.prod(shape) for shape in shapes)
        self.theta = tf.placeholder(tf.float32, [total_size])
        start = 0
        assigns = []
        for (shape, v) in zip(shapes, var_list):
            size = np.prod(shape)
            assigns.append(tf.assign(v,tf.reshape(self.theta[start:start + size],shape)))
            start += size
        self.op = tf.group(*assigns)

    def __call__(self, theta):
        self.session.run(self.op, feed_dict={self.theta: theta})

Jun 21 '17 00:06 Breakend