Policies are not stochastic on arm64 architectures

Open kamilazdybal opened this issue 3 months ago • 0 comments

I'm implementing a continuous-action space for TF-Agents, where I want the action to be a four-element array with elements:

$s \in [0,10]$, $dx \in [-1,1]$, $dy \in [-1,1]$, $dz \in [-1,1]$

I'm then using RandomTFPolicy to test taking actions for a batch of 5 observations. This is what I'm getting every time I run my code:

[[ 3.1179297  -0.37641406 -0.37641406 -0.37641406]
 [ 8.263412    0.65268254  0.65268254  0.65268254]
 [ 6.849456    0.36989117  0.36989117  0.36989117]
 [ 0.06709099 -0.9865818  -0.9865818  -0.9865818 ]
 [ 7.8749514   0.5749903   0.5749903   0.5749903 ]]

My questions are:

How come $dx$, $dy$, and $dz$ are the same float? Why aren't they sampled independently?
How come I get the exact same action values every time I run my code? I'm not setting random seeds anywhere.

I'm using arm64 MacOS and:

python==3.11.13
tf-agents==0.19.0
tensorflow==2.15.1
tensorflow-metal==1.1.0
tensorflow-probability==0.23.0
numpy==1.26.4

Interestingly, this does not happen on a x86 MacOS, neither on a Windows machine with the same package versions! There, all numbers are random:

[[ 2.9280972e+00 -3.7891769e-01 -7.7160120e-02  8.4350657e-01]
 [ 2.3010242e+00 -1.9348240e-01 -6.7645931e-01  2.9825187e-01]
 [ 6.4993248e+00  4.0297508e-03 -5.8490920e-01 -5.0786805e-01]
 [ 9.6005363e+00 -2.8406858e-01 -7.8258038e-02  5.8963799e-01]
 [ 1.4861953e+00 -8.2189059e-01 -2.9714632e-01 -5.1117587e-01]]

My code:


import numpy as np
import tensorflow as tf
from tf_agents.specs import array_spec
from tf_agents.trajectories import time_step as ts
from tf_agents.policies import random_tf_policy

observation_spec = array_spec.BoundedArraySpec(
    shape=(64, 64, 2),
    dtype=np.float32,
    minimum=0.0,
    maximum=1.0,
    name="observation",
)

action_spec = array_spec.BoundedArraySpec(
    shape=(4,),
    dtype=np.float32,
    minimum=np.array([0.0, -1.0, -1.0, -1.0], dtype=np.float32),
    maximum=np.array([10.0,  1.0,  1.0,  1.0], dtype=np.float32),
    name="action",
)

time_step_spec = ts.time_step_spec(observation_spec)

policy = random_tf_policy.RandomTFPolicy(time_step_spec=time_step_spec,
                                         action_spec=action_spec)

obs = tf.random.uniform(shape=(5, 64, 64, 2), minval=0.0, maxval=1.0, dtype=tf.float32)

timestep = ts.restart(observation=obs, batch_size=5)
action_step = policy.action(timestep, seed=None)
actions = action_step.action

print(actions.numpy())

Moreover, I'm also seeing that agent.collect_policy leads to the same action sampled for the same observation value, just like it should be for agent.policy. My understanding is that collect_policy should always be stochastic? Here's a couple of actions where the last 11 actions correspond to the agent seeing the exact same observation (note that the actions are deterministic at this point but should be stochastic):

<tf.Tensor: shape=(1, 20, 4), dtype=float32, numpy=
array([[[ 4.543779  , -0.7549822 , -0.98628926, -0.6924889 ],
        [ 4.543779  , -0.75414133, -0.9862873 , -0.6924889 ],
        [ 4.543778  , -0.7378491 , -0.98595184, -0.6924772 ],
        [ 4.5437074 , -0.69559884, -0.98290896, -0.69219804],
        [ 4.5435147 , -0.66731834, -0.97919464, -0.6916727 ],
        [ 4.543341  , -0.6532383 , -0.9768447 , -0.6912718 ],
        [ 4.5433545 , -0.6541612 , -0.9770225 , -0.6912997 ],
        [ 4.543534  , -0.6692811 , -0.979528  , -0.69171715],
        [ 4.5436664 , -0.6870763 , -0.9819559 , -0.69207287],
        [ 4.5437007 , -0.6941073 , -0.982763  , -0.6921773 ],
        [ 4.5437007 , -0.6941073 , -0.982763  , -0.6921773 ],
        [ 4.5437007 , -0.6941073 , -0.982763  , -0.6921773 ],
        [ 4.5437007 , -0.6941073 , -0.982763  , -0.6921773 ],
        [ 4.5437007 , -0.6941073 , -0.982763  , -0.6921773 ],
        [ 4.5437007 , -0.6941073 , -0.982763  , -0.6921773 ],
        [ 4.5437007 , -0.6941073 , -0.982763  , -0.6921773 ],
        [ 4.5437007 , -0.6941073 , -0.982763  , -0.6921773 ],
        [ 4.5437007 , -0.6941073 , -0.982763  , -0.6921773 ],
        [ 4.5437007 , -0.6941073 , -0.982763  , -0.6921773 ],
        [ 4.5437007 , -0.6941073 , -0.982763  , -0.6921773 ]]],
      dtype=float32)>

I've used a simple REINFORCE agent here:


agent = reinforce_agent.ReinforceAgent(time_step_spec=train_env.time_step_spec(),
                                       action_spec=train_env.action_spec(),
                                       actor_network=actor_net,
                                       optimizer=optimizer,
                                       train_step_counter=train_step_counter, 
                                       gamma=0.95, 
                                       normalize_returns=False, 
                                       entropy_regularization=None)

agent.initialize()

Nov 13 '25 14:11 kamilazdybal