agents icon indicating copy to clipboard operation
agents copied to clipboard

Does TF-Agents not support XLA?

Open connor-create opened this issue 2 years ago • 5 comments

I have built a dynamic step driver but cannot seem to get it to work with jit_compile=True.

driver = Driver()
# Setup driver #

driver.run = tfa_common.function(driver.run, jit_compile=True)

Leads to this error on execution of training:

W tensorflow/core/framework/op_kernel.cc:1745] OP_REQUIRES failed at xla_ops.cc:287 : INVALID_ARGUMENT: Trying to access resource  (defined @ /home/connorjaynes/.pyenv/versions/3.9.0/lib/python3.9/site-packages/tf_agents/replay_buffers/tf_uniform_replay_buffer.py:155) located in device /job:localhost/replica:0/task:0/device:CPU:0 from device /job:localhost/replica:0/task:0/device:GPU:0

I'd imagine this is because of the tf_agent creating something with a dtype of int32, and that tensor then being placed on the CPU because of it. The first known issue here.

I have also explored the utility functions in xla.py but they don't seem to work either as the dynamic step driver will create a objects to pass into the compiled function in the kwargs, which XLA doesn't allow to happen.

driver = Driver()
# Setup driver #

driver.run = xla.compile_in_graph_mode(driver.run)

Leads to this error on execution of the training:

kwargs are not supported for functions that are XLA-compiled, but saw kwargs: {'time_step': TimeStep(
{'discount': <tf.Tensor: shape=(32,), dtype=float32, numpy=
array([1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
       1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
      dtype=float32)>,
 'observation': <tf.Tensor: shape=(32, 4), dtype=float32, numpy=
array([blah],
      dtype=float32)>,
 'reward': <tf.Tensor: shape=(32,), dtype=float32, numpy=
array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
      dtype=float32)>,
 'step_type': <tf.Tensor: shape=(32,), dtype=int32, numpy=
array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0], dtype=int32)>}), 'policy_state': ()}

Is there any way to apply XLA to these functions using some sort of trickle down configuration option or was this never implemented or supported? I don't see anything in the documentation about this usage either.

Thanks.

connor-create avatar Jul 04 '22 21:07 connor-create

You can try overwriting the dtype of the step_type in the Policy given to the Driver.

sguada avatar Jul 04 '22 21:07 sguada

@sguada Hello! I am also on @connor-create's team.

We tried this yesterday and found that indeed, by making step_type an int64 it is correctly placed on the GPU device. However, XLA seems to still give similar errors as there are still some extraneous TF variables of enum type that are placed on the CPU. We will continue to see if a similar fix can be found for this related issue.

Is there some working example of an XLA-enabled RL training loop with tf_agents? That would be very helpful to us.

These are the remaining vars that seem to be giving problems:

-------------------------------------------------------------------------------------------------
tf.Tensor(<ResourceHandle(name="Resource-24-at-0x55eb9c29f270", device="/job:localhost/replica:0/task:0/device:CPU:0", container="Anonymous", type="tensorflow::Var", dtype and shapes : "[ DType enum: 9, Shape: [32000] ]")>, shape=(), dtype=resource)
-------------------------------------------------------------------------------------------------
tf.Tensor(<ResourceHandle(name="Resource-18-at-0x55eb9c24feb0", device="/job:localhost/replica:0/task:0/device:CPU:0", container="Anonymous", type="tensorflow::Var", dtype and shapes : "[ DType enum: 9, Shape: [32000] ]")>, shape=(), dtype=resource)
-------------------------------------------------------------------------------------------------
tf.Tensor(<ResourceHandle(name="Resource-19-at-0x55eb9c254f00", device="/job:localhost/replica:0/task:0/device:CPU:0", container="Anonymous", type="tensorflow::Var", dtype and shapes : "[ DType enum: 1, Shape: [32000,4] ]")>, shape=(), dtype=resource)
-------------------------------------------------------------------------------------------------
tf.Tensor(<ResourceHandle(name="Resource-20-at-0x55eb9c29a3e0", device="/job:localhost/replica:0/task:0/device:CPU:0", container="Anonymous", type="tensorflow::Var", dtype and shapes : "[ DType enum: 9, Shape: [32000] ]")>, shape=(), dtype=resource)
-------------------------------------------------------------------------------------------------
tf.Tensor(<ResourceHandle(name="Resource-21-at-0x55eb9c29c1b0", device="/job:localhost/replica:0/task:0/device:CPU:0", container="Anonymous", type="tensorflow::Var", dtype and shapes : "[ DType enum: 9, Shape: [32000] ]")>, shape=(), dtype=resource)

sikanrong avatar Jul 06 '22 08:07 sikanrong

I also have questions about using XLA with tf_agents.

Is there some working example of an XLA-enabled RL training loop with tf_agents?

This would be amazing. Anyone have any info here at all?

chazzmoney avatar Aug 04 '22 23:08 chazzmoney

Unfortunately the DynamicDriver has dynamic shapes and doesn't allow jit-compilation, you can compile the Network or the Policy though.

sguada avatar Aug 08 '22 19:08 sguada

I'm getting similar error while training through model.fit() with MirroredStrategy on multiple GPUs. Is there solution or workaround to avoid XLA comilation. Also, what is the cause of this kind of error. In my case, it looks like there is a communication issues between GPU:0 and other three GPUs. Any help is much appreciated. Thank you!

dalalkrish avatar Dec 01 '23 19:12 dalalkrish