How can I train it on multi-GPU

Open Da-Capo opened this issue 4 years ago • 13 comments

I have tried change OneDeviceStrategy to be MirroredStrategy. https://github.com/google-research/seed_rl/blob/eff7aaa7ab5843547fbf383fcc747b7a8ca67632/common/utils.py#L58

But I get valueerror below.

Could not convert from `tf.VariableAggregation` VariableAggregation.NONE to`tf.distribute.ReduceOp`
 type
Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/op_def_library.py", l
ine 468, in _apply_op_helper
    preferred_dtype=default_dtype)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py", line 1314, i
n convert_to_tensor
    ret = conversion_func(value, dtype=dtype, name=name, as_ref=as_ref)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/distribute/values.py", line 138
1, in _tensor_conversion_sync_on_read
    return var._dense_var_to_tensor(dtype=dtype, name=name, as_ref=as_ref)  # pylint: disable=protect
ed-access
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/distribute/values.py", line 137
1, in _dense_var_to_tensor
    self.get(), dtype=dtype, name=name, as_ref=as_ref)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/distribute/values.py", line 322
, in get
    return self._get_cross_replica()
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/distribute/values.py", line 134
6, in _get_cross_replica
    reduce_util.ReduceOp.from_variable_aggregation(self.aggregation),
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/distribute/reduce_util.py", lin
e 50, in from_variable_aggregation
    "`tf.distribute.ReduceOp` type" % aggregation)
ValueError: Could not convert from `tf.VariableAggregation` VariableAggregation.NONE to`tf.distribute
.ReduceOp` type

But when I apply clip_norm to temp_grads, that error disappeared. I‘m not sure will this change cause synchronization risk.

def apply_gradients(_):
  clip_grads, _ = tf.clip_by_global_norm(temp_grads, 40)
  optimizer.apply_gradients(zip(clip_grads, agent.trainable_variables))

Apr 17 '20 06:04 Da-Capo

One needs to do something similar to this part of the code to run with multiple GPUs. https://github.com/google-research/seed_rl/blob/eff7aaa7ab5843547fbf383fcc747b7a8ca67632/common/utils.py#L42-L52

Can you show me the exact change you did to the code?

Apr 30 '20 09:04 lespeholt

Here is my change: https://github.com/Da-Capo/seed_rl/commit/a05ca9bd2daf26bccb020d53f356833b48c78783 If I use temp_grads, it will get an valueerror, but the clip_grads works well, so I wonder is the clip_grads synchronized correctly.

def apply_gradients(_):
  optimizer.apply_gradients(zip(temp_grads, agent.trainable_variables))

Could not convert from `tf.VariableAggregation` VariableAggregation.NONE to`tf.distribute.ReduceOp`
 type
Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/op_def_library.py", l
ine 468, in _apply_op_helper
    preferred_dtype=default_dtype)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py", line 1314, i
n convert_to_tensor
    ret = conversion_func(value, dtype=dtype, name=name, as_ref=as_ref)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/distribute/values.py", line 138
1, in _tensor_conversion_sync_on_read
    return var._dense_var_to_tensor(dtype=dtype, name=name, as_ref=as_ref)  # pylint: disable=protect
ed-access
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/distribute/values.py", line 137
1, in _dense_var_to_tensor
    self.get(), dtype=dtype, name=name, as_ref=as_ref)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/distribute/values.py", line 322
, in get
    return self._get_cross_replica()
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/distribute/values.py", line 134
6, in _get_cross_replica
    reduce_util.ReduceOp.from_variable_aggregation(self.aggregation),
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/distribute/reduce_util.py", lin
e 50, in from_variable_aggregation
    "`tf.distribute.ReduceOp` type" % aggregation)
ValueError: Could not convert from `tf.VariableAggregation` VariableAggregation.NONE to`tf.distribute
.ReduceOp` type

Apr 30 '20 12:04 Da-Capo

Can you try:

def apply_gradients(_): optimizer.apply_gradients(zip([g + 0 for g in temp_grads], agent.trainable_variables))

def apply_gradients(_): optimizer.apply_gradients(zip([g.read_value() for g in temp_grads], agent.trainable_variables))

Apr 30 '20 15:04 lespeholt

One needs to do something similar to this part of the code to run with multiple GPUs.

https://github.com/google-research/seed_rl/blob/eff7aaa7ab5843547fbf383fcc747b7a8ca67632/common/utils.py#L42-L52

@lespeholt I tried similar thing to train on multi GPU. It turns out that with MirroredStrategy we can't split the devices into inference devices and training devices as with TPU. Do you think it is the correct behaviour ? By the way it works by using a single strategy that uses all the devices.

May 04 '20 07:05 jrabary

temp_grad2=[g + 0 for g in temp_grads] because temp_grads are tf.VariableSynchronization.ON_READ,this operation triggers on_read event,temp2_grad should be the agrregation value of all replicas in different devies.But in fact temp2_grad obtain the same value as temp_grad when i use tf.print,Why ? thank you for your answering @lespeholt

May 04 '20 07:05 1576012404

within the experimental_run_v2 this shouldn't trigger a synchronization. Only outside experimental_run_v1

"we can't split the devices into inference devices and training devices as with TPU" what do you mean with "we can't"? what problems are you running into?

May 05 '20 14:05 lespeholt

within the experimental_run_v2 this shouldn't trigger a synchronization. Only outside experimental_run_v1

"we can't split the devices into inference devices and training devices as with TPU" what do you mean with "we can't"? what problems are you running into?

This code https://github.com/google-research/seed_rl/blob/eff7aaa7ab5843547fbf383fcc747b7a8ca67632/agents/sac/learner.py#L540 seems to not working with MirroredStrategy with several GPUs if you split your GPUs into inference devices and training devices as with TPU.

Jun 02 '20 10:06 jrabary

Sorry, it's still not clear to me what you mean with not working. That line should work just fine with multiple GPUs.

Jun 03 '20 14:06 lespeholt

Sorry, it's still not clear to me what you mean with not working. That line should work just fine with multiple GPUs.

Indeed, I was not clear. That line works fine with GPU. The issue I encountered is when I tried to do similar configuration as defined here https://github.com/google-research/seed_rl/blob/135e5612c019cf5446dc5d57105088ce17c5357c/common/utils.py#L41 Where the TPUs are grouped by inference and training. I began to declare two different MirrorStrategy but it seems not to be the good way to do this. By the way, I have an example of Multiple GPU training with one single mirror strategy that works well and I hope I can share it soon.

Jun 25 '20 07:06 jrabary

@jrabary is there any update on the multi-gpu inference + training strategies ? do you have any example of getting it to work correctly?

I'm running into a few issues myself and I'd love to get a view of a working version of it.

Oct 08 '20 15:10 brieyla1

@brieyla1 With applying the fix to gradients from above:

optimizer.apply_gradients(zip([g.read_value() for g in temp_grads], agent.trainable_variables))

Following should suffice to run something mGPU with inference on a separate device:

    device_name = any_gpu[0].name if any_gpu else '/device:CPU:0'

    num_gpus = len(any_gpu)
    if num_gpus < 2:
      # a single GPU or CPU both inference and training
      strategy = tf.distribute.OneDeviceStrategy(device=device_name)
    elif num_gpus == 2:
      # one GPU for inference, one for training
      # pehaps not the wisest choice and better to mingle, benchmark if in doubt
      strategy = tf.distribute.OneDeviceStrategy(device=any_gpu[1].name)
    else:
      # one GPU for inference, rest DataParallel for training
      strategy = tf.distribute.MirroredStrategy(devices=[d.name for d in any_gpu[1:]])

As a replacement for: https://github.com/google-research/seed_rl/blob/5f07ba2a072c7a562070b5a0b3574b86cd72980f/common/utils.py#L102-L103

EDIT: having been looking through the code I also believe one should rename num_training_tpus to num_training_devices and set it accordingly for GPUs as well as it seems to play a role in other places such as splitting training batch between devices.

@lespeholt Could you please explain why the problem with gradients occurs?

Dec 01 '20 16:12 Antymon

I want to run the codes on multi-machines with multi-GPU, and i want to know: how to recompile the code? My try: use tf.distribute.experimental.MultiWorkerMirroredStrategy()

10.17.8.112 ~/common/utils.py

import os, json os.environ['TF_CONFIG'] = json.dumps({ 'cluster': { 'worker':[' 10.17.8.112:6000', '10.17.8.109:6000'], }, 'task': { 'type': 'worker', 'index': 0 } })

multi_strategy = tf.distribute.experimental.MultiWorkerMirroredStrategy( tf.distribute.experimental.CollectiveCommunication.NCCL)

..........

def init_learner_multi_host(num_training_tpus: int):

.......

 else:
       strategy = multi_startegy
       enc = lambda  x: x
       dec = lambda  x, s=None : x if s is None else  tf.nest.pack_sequence_as(s, x)
       return  MultiHostSettings(
               strategy, [( '/cpu',  [device_name])], strategy , enc, dec)

10.17.8.109 ~/common/utils.py

import os, json os.environ['TF_CONFIG'] = json.dumps({ 'cluster': { 'worker':[' 10.17.8.112:6000', '10.17.8.109:6000'], }, 'task': { 'type': 'worker', 'index': 1 } })

multi_strategy = tf.distribute.experimental.MultiWorkerMirroredStrategy( tf.distribute.experimental.CollectiveCommunication.NCCL)

..........

def init_learner_multi_host(num_training_tpus: int):

.......

 else:
       strategy = multi_startegy
       enc = lambda  x: x
       dec = lambda  x, s=None : x if s is None else  tf.nest.pack_sequence_as(s, x)
       return  MultiHostSettings(
               strategy, [( '/cpu',  [device_name])], strategy , enc, dec)

error: Unknown : Could not start gRPC server.

Dec 29 '20 08:12 giantvision

I want to run the codes on multi-machines with multi-GPU, and i want to know: how to recompile the code?

My try: use tf.distribute.experimental.MultiWorkerMirroredStrategy() 10.17.8.112 ~/common/utils.py

import os, json os.environ['TF_CONFIG'] = json.dumps({ 'cluster': { 'worker':[' 10.17.8.112:6000', '10.17.8.109:6000'], }, 'task': { 'type': 'worker', 'index': 0 } })

multi_strategy = tf.distribute.experimental.MultiWorkerMirroredStrategy( tf.distribute.experimental.CollectiveCommunication.NCCL)

..........

def init_learner_multi_host(num_training_tpus: int):

.......
 else:
       strategy = multi_startegy
       enc = lambda  x: x
       dec = lambda  x, s=None : x if s is None else  tf.nest.pack_sequence_as(s, x)
       return  MultiHostSettings(
               strategy, [( '/cpu',  [device_name])], strategy , enc, dec)
10.17.8.109 ~/common/utils.py

import os, json os.environ['TF_CONFIG'] = json.dumps({ 'cluster': { 'worker':[' 10.17.8.112:6000', '10.17.8.109:6000'], }, 'task': { 'type': 'worker', 'index': 1 } })

multi_strategy = tf.distribute.experimental.MultiWorkerMirroredStrategy( tf.distribute.experimental.CollectiveCommunication.NCCL)

..........

def init_learner_multi_host(num_training_tpus: int):

.......
 else:
       strategy = multi_startegy
       enc = lambda  x: x
       dec = lambda  x, s=None : x if s is None else  tf.nest.pack_sequence_as(s, x)
       return  MultiHostSettings(
               strategy, [( '/cpu',  [device_name])], strategy , enc, dec)
error: Unknown : Could not start gRPC server.

#version: Multi-host Multi GPUs：CPU for inference，GPU for training

other details

import os, json os.environ['TF_CONFIG'] = json.dumps({ 'cluster': {'worker': ['10.17.8.109:6001', '10.17.8.112:6001']}, 'task': {'type': 'worker', 'index': 0} })

def init_learner_multi_host(num_training_tpus: int): ... multi_strategy = tf.distribute.experimental.MultiWorkerMirroredStrategy() ...

tf.device('/cpu').enter() device_name = '/device:CPU:0' strategy1 = tf.distribute.OneDeviceStrategy(device=device_name) strategy2 = multi_strategy enc = lambda x:x dec = lambda x, s =None: x if s is None else tf.nest.pack_sequence_as(s, x) return MultiHostSettings(strategy1, [('/cpu', [device_name])], strategy2, enc, dec)

#This works！

Jan 06 '21 08:01 giantvision

seed_rl seed_rl copied to clipboard

How can I train it on multi-GPU

I want to run the codes on multi-machines with multi-GPU, and i want to know: how to recompile the code? My try: use tf.distribute.experimental.MultiWorkerMirroredStrategy()

I want to run the codes on multi-machines with multi-GPU, and i want to know: how to recompile the code?

other details

seed_rl
seed_rl copied to clipboard