mesh icon indicating copy to clipboard operation
mesh copied to clipboard

Performance on GPUs and multiple GPU support

Open nict-wisdom opened this issue 4 years ago • 12 comments

We tried to run Mesh-TensorFlow to train T5 on GPUs following the instructions on T5's repository, but the training is extremely slow.

global_step/sec: 0.0467347 examples/sec: 0.186939

The training script successfully detected GPUs (showing "Adding visible gpu devices: ..."), but most of computation seems to run on a CPU. By enabling log_device_placement, we can see many operators on both CPUs and GPUs. ProfilerHook showed that it actually uses both, but I couldn't know if the behavior is expected or not.

I am wondering if Mesh-TensorFlow runs on GPUs in a practical sense. I found an issue that mentioned a similar problem, but it was closed with no answer (#35).

I also failed to find reliable documents about training on multiple GPUs. An existing issue #20 mentioned the same question, but no answer was given.

I appreciate if someone could give us any information regarding the above questions.

nict-wisdom avatar Mar 27 '20 01:03 nict-wisdom

Facing the same issue.

mcompute avatar Mar 27 '20 13:03 mcompute

facing same issue. can someone share some answers for this?

LiweiPeng avatar May 15 '20 22:05 LiweiPeng

Also seeing this issue. Monitoring GPU usage shows that only one GPU is being utilized when running BERT.

knagrecha avatar May 18 '20 05:05 knagrecha

The current MNIST example is just using a single GPU in AMD/RocM platforms.

xdgarrido avatar Jun 02 '20 17:06 xdgarrido

I can run the mnist example on a GPU. Does not appear to be utilizing CPU resources. However, when using 4 GPU, only the first device is actually utilized.

Hopefully we can get a developer response on this... ~I can't see what would need to be modified in mnist.py to make distributed GPU training work.~

EDIT: specifying your devices by name ['gpu:0, 'gpu:1', 'gpu:2'] instead of [''] * mesh_size solves the problem for me

PSZehnder avatar Jul 13 '20 18:07 PSZehnder

@PSZehnder Does mesh tensorflow supports multi node training ( i.e. each node has #x GPUs attached to it)? I'm using 2 nodes each with 8 GPUs and would like to train on the entire (2 nodes *8 gpus )=16 GPUs. How do I configure mesh tensorflow to train in a multi node setup?

assij avatar Oct 06 '20 11:10 assij

@nshazeer Does mesh tensorflow supports multi node training ( i.e. each node has #x GPUs attached to it)? I'm using 2 nodes each with 8 GPUs and would like to train on the entire (2 nodes *8 gpus )=16 GPUs. How do I configure mesh tensorflow to train in a multi node setup?

assij avatar Oct 07 '20 08:10 assij

Yes, that should be possible, though I haven't done it. The GPU code just relies on device placement, so if you can construct a TF graph which can name all of the 16 GPUs as different devices, it should work...

nshazeer avatar Oct 07 '20 20:10 nshazeer

@nshazeer , Thanks for your reply. If I can make the 16 GPUs visible ,How the data loading will be done in a 2 node * 8 GPUs ? Will the data be loaded through 1 CPU in node0 ( where I run the script, so 1 CPU sends data to 16 GPUs) or the data loading will be done from the 2 cpus ( node0 and node1), so each CPU sends the data which is relevant to the 8 GPUs it its connected to. ?

assij avatar Oct 07 '20 21:10 assij

@nict-wisdom do you have a snippet showing how you used the ProfilerHook, I am a bit struggling with it atm.

zaccharieramzi avatar Dec 02 '20 10:12 zaccharieramzi

Met the same problem, anyone on this team can reply this issue?

weberxie avatar Mar 31 '21 08:03 weberxie

We are also facing the same issue. Any help in this context will be highly appreciated.

Conformist101 avatar May 06 '21 18:05 Conformist101