seed_rl
seed_rl copied to clipboard
How can I train it on multi-GPU
I have tried change OneDeviceStrategy
to be MirroredStrategy
.
https://github.com/google-research/seed_rl/blob/eff7aaa7ab5843547fbf383fcc747b7a8ca67632/common/utils.py#L58
But I get valueerror below.
Could not convert from `tf.VariableAggregation` VariableAggregation.NONE to`tf.distribute.ReduceOp`
type
Traceback (most recent call last):
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/op_def_library.py", l
ine 468, in _apply_op_helper
preferred_dtype=default_dtype)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py", line 1314, i
n convert_to_tensor
ret = conversion_func(value, dtype=dtype, name=name, as_ref=as_ref)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/distribute/values.py", line 138
1, in _tensor_conversion_sync_on_read
return var._dense_var_to_tensor(dtype=dtype, name=name, as_ref=as_ref) # pylint: disable=protect
ed-access
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/distribute/values.py", line 137
1, in _dense_var_to_tensor
self.get(), dtype=dtype, name=name, as_ref=as_ref)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/distribute/values.py", line 322
, in get
return self._get_cross_replica()
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/distribute/values.py", line 134
6, in _get_cross_replica
reduce_util.ReduceOp.from_variable_aggregation(self.aggregation),
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/distribute/reduce_util.py", lin
e 50, in from_variable_aggregation
"`tf.distribute.ReduceOp` type" % aggregation)
ValueError: Could not convert from `tf.VariableAggregation` VariableAggregation.NONE to`tf.distribute
.ReduceOp` type
But when I apply clip_norm to temp_grads, that error disappeared. I‘m not sure will this change cause synchronization risk.
def apply_gradients(_):
clip_grads, _ = tf.clip_by_global_norm(temp_grads, 40)
optimizer.apply_gradients(zip(clip_grads, agent.trainable_variables))
One needs to do something similar to this part of the code to run with multiple GPUs. https://github.com/google-research/seed_rl/blob/eff7aaa7ab5843547fbf383fcc747b7a8ca67632/common/utils.py#L42-L52
Can you show me the exact change you did to the code?
Here is my change:
https://github.com/Da-Capo/seed_rl/commit/a05ca9bd2daf26bccb020d53f356833b48c78783
If I use temp_grads
, it will get an valueerror, but the clip_grads
works well, so I wonder is the clip_grads
synchronized correctly.
def apply_gradients(_):
optimizer.apply_gradients(zip(temp_grads, agent.trainable_variables))
Could not convert from `tf.VariableAggregation` VariableAggregation.NONE to`tf.distribute.ReduceOp`
type
Traceback (most recent call last):
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/op_def_library.py", l
ine 468, in _apply_op_helper
preferred_dtype=default_dtype)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py", line 1314, i
n convert_to_tensor
ret = conversion_func(value, dtype=dtype, name=name, as_ref=as_ref)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/distribute/values.py", line 138
1, in _tensor_conversion_sync_on_read
return var._dense_var_to_tensor(dtype=dtype, name=name, as_ref=as_ref) # pylint: disable=protect
ed-access
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/distribute/values.py", line 137
1, in _dense_var_to_tensor
self.get(), dtype=dtype, name=name, as_ref=as_ref)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/distribute/values.py", line 322
, in get
return self._get_cross_replica()
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/distribute/values.py", line 134
6, in _get_cross_replica
reduce_util.ReduceOp.from_variable_aggregation(self.aggregation),
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/distribute/reduce_util.py", lin
e 50, in from_variable_aggregation
"`tf.distribute.ReduceOp` type" % aggregation)
ValueError: Could not convert from `tf.VariableAggregation` VariableAggregation.NONE to`tf.distribute
.ReduceOp` type
Can you try:
def apply_gradients(_): optimizer.apply_gradients(zip([g + 0 for g in temp_grads], agent.trainable_variables))
or
def apply_gradients(_): optimizer.apply_gradients(zip([g.read_value() for g in temp_grads], agent.trainable_variables))
One needs to do something similar to this part of the code to run with multiple GPUs.
https://github.com/google-research/seed_rl/blob/eff7aaa7ab5843547fbf383fcc747b7a8ca67632/common/utils.py#L42-L52
@lespeholt I tried similar thing to train on multi GPU. It turns out that with MirroredStrategy we can't split the devices into inference devices and training devices as with TPU. Do you think it is the correct behaviour ? By the way it works by using a single strategy that uses all the devices.
temp_grad2=[g + 0 for g in temp_grads]
because temp_grads are tf.VariableSynchronization.ON_READ,this operation triggers on_read event,temp2_grad should be the agrregation value of all replicas in different devies.But in fact temp2_grad obtain the same value as temp_grad when i use tf.print,Why ?
thank you for your answering @lespeholt
within the experimental_run_v2 this shouldn't trigger a synchronization. Only outside experimental_run_v1
"we can't split the devices into inference devices and training devices as with TPU" what do you mean with "we can't"? what problems are you running into?
within the experimental_run_v2 this shouldn't trigger a synchronization. Only outside experimental_run_v1
"we can't split the devices into inference devices and training devices as with TPU" what do you mean with "we can't"? what problems are you running into?
This code https://github.com/google-research/seed_rl/blob/eff7aaa7ab5843547fbf383fcc747b7a8ca67632/agents/sac/learner.py#L540 seems to not working with MirroredStrategy with several GPUs if you split your GPUs into inference devices and training devices as with TPU.
Sorry, it's still not clear to me what you mean with not working. That line should work just fine with multiple GPUs.
Sorry, it's still not clear to me what you mean with not working. That line should work just fine with multiple GPUs.
Indeed, I was not clear. That line works fine with GPU. The issue I encountered is when I tried to do similar configuration as defined here https://github.com/google-research/seed_rl/blob/135e5612c019cf5446dc5d57105088ce17c5357c/common/utils.py#L41 Where the TPUs are grouped by inference and training. I began to declare two different MirrorStrategy but it seems not to be the good way to do this. By the way, I have an example of Multiple GPU training with one single mirror strategy that works well and I hope I can share it soon.
@jrabary is there any update on the multi-gpu inference + training strategies ? do you have any example of getting it to work correctly?
I'm running into a few issues myself and I'd love to get a view of a working version of it.
@brieyla1 With applying the fix to gradients from above:
optimizer.apply_gradients(zip([g.read_value() for g in temp_grads], agent.trainable_variables))
Following should suffice to run something mGPU with inference on a separate device:
device_name = any_gpu[0].name if any_gpu else '/device:CPU:0'
num_gpus = len(any_gpu)
if num_gpus < 2:
# a single GPU or CPU both inference and training
strategy = tf.distribute.OneDeviceStrategy(device=device_name)
elif num_gpus == 2:
# one GPU for inference, one for training
# pehaps not the wisest choice and better to mingle, benchmark if in doubt
strategy = tf.distribute.OneDeviceStrategy(device=any_gpu[1].name)
else:
# one GPU for inference, rest DataParallel for training
strategy = tf.distribute.MirroredStrategy(devices=[d.name for d in any_gpu[1:]])
As a replacement for: https://github.com/google-research/seed_rl/blob/5f07ba2a072c7a562070b5a0b3574b86cd72980f/common/utils.py#L102-L103
EDIT: having been looking through the code I also believe one should rename num_training_tpus
to num_training_devices
and set it accordingly for GPUs as well as it seems to play a role in other places such as splitting training batch between devices.
@lespeholt Could you please explain why the problem with gradients occurs?
I want to run the codes on multi-machines with multi-GPU, and i want to know: how to recompile the code? My try: use tf.distribute.experimental.MultiWorkerMirroredStrategy()
10.17.8.112 ~/common/utils.py
import os, json os.environ['TF_CONFIG'] = json.dumps({ 'cluster': { 'worker':[' 10.17.8.112:6000', '10.17.8.109:6000'], }, 'task': { 'type': 'worker', 'index': 0 } })
multi_strategy = tf.distribute.experimental.MultiWorkerMirroredStrategy( tf.distribute.experimental.CollectiveCommunication.NCCL)
..........
def init_learner_multi_host(num_training_tpus: int):
.......
else:
strategy = multi_startegy
enc = lambda x: x
dec = lambda x, s=None : x if s is None else tf.nest.pack_sequence_as(s, x)
return MultiHostSettings(
strategy, [( '/cpu', [device_name])], strategy , enc, dec)
10.17.8.109 ~/common/utils.py
import os, json os.environ['TF_CONFIG'] = json.dumps({ 'cluster': { 'worker':[' 10.17.8.112:6000', '10.17.8.109:6000'], }, 'task': { 'type': 'worker', 'index': 1 } })
multi_strategy = tf.distribute.experimental.MultiWorkerMirroredStrategy( tf.distribute.experimental.CollectiveCommunication.NCCL)
..........
def init_learner_multi_host(num_training_tpus: int):
.......
else:
strategy = multi_startegy
enc = lambda x: x
dec = lambda x, s=None : x if s is None else tf.nest.pack_sequence_as(s, x)
return MultiHostSettings(
strategy, [( '/cpu', [device_name])], strategy , enc, dec)
error: Unknown : Could not start gRPC server.
I want to run the codes on multi-machines with multi-GPU, and i want to know: how to recompile the code?
My try: use tf.distribute.experimental.MultiWorkerMirroredStrategy() 10.17.8.112 ~/common/utils.py
import os, json os.environ['TF_CONFIG'] = json.dumps({ 'cluster': { 'worker':[' 10.17.8.112:6000', '10.17.8.109:6000'], }, 'task': { 'type': 'worker', 'index': 0 } })
multi_strategy = tf.distribute.experimental.MultiWorkerMirroredStrategy( tf.distribute.experimental.CollectiveCommunication.NCCL)
..........
def init_learner_multi_host(num_training_tpus: int):
.......
else: strategy = multi_startegy enc = lambda x: x dec = lambda x, s=None : x if s is None else tf.nest.pack_sequence_as(s, x) return MultiHostSettings( strategy, [( '/cpu', [device_name])], strategy , enc, dec)
10.17.8.109 ~/common/utils.py
import os, json os.environ['TF_CONFIG'] = json.dumps({ 'cluster': { 'worker':[' 10.17.8.112:6000', '10.17.8.109:6000'], }, 'task': { 'type': 'worker', 'index': 1 } })
multi_strategy = tf.distribute.experimental.MultiWorkerMirroredStrategy( tf.distribute.experimental.CollectiveCommunication.NCCL)
..........
def init_learner_multi_host(num_training_tpus: int):
.......
else: strategy = multi_startegy enc = lambda x: x dec = lambda x, s=None : x if s is None else tf.nest.pack_sequence_as(s, x) return MultiHostSettings( strategy, [( '/cpu', [device_name])], strategy , enc, dec)
error: Unknown : Could not start gRPC server.
#version: Multi-host Multi GPUs:CPU for inference,GPU for training
other details
import os, json os.environ['TF_CONFIG'] = json.dumps({ 'cluster': {'worker': ['10.17.8.109:6001', '10.17.8.112:6001']}, 'task': {'type': 'worker', 'index': 0} })
def init_learner_multi_host(num_training_tpus: int): ... multi_strategy = tf.distribute.experimental.MultiWorkerMirroredStrategy() ...
tf.device('/cpu').enter() device_name = '/device:CPU:0' strategy1 = tf.distribute.OneDeviceStrategy(device=device_name) strategy2 = multi_strategy enc = lambda x:x dec = lambda x, s =None: x if s is None else tf.nest.pack_sequence_as(s, x) return MultiHostSettings(strategy1, [('/cpu', [device_name])], strategy2, enc, dec)
#This works!