lingvo
lingvo copied to clipboard
Using multiple GPUs for training on a single machine?
I am trying to train a simple 34-layer Resnet model with the Imagenet dataset on a machine with multiple GPU cards (all v100s). I am using Lingvo-0.6.2 synced from github with TF 2.1.0 with Ubuntu 18.04.
I initially tried this setting: "--mode=sync --worker_gpus=2 --worker_split_size=2" From nvidia-smi, I can see that both GPUs are used by Lingvo. But I am getting exactly the same speed as what I got from a single GPU. So basically it worked, but came with no performance gain. I also checked the other thread, and I saw CPUs are pretty idle, so definitely it is not bottlenecked at the CPU tasks such as reading examples.
Then I tried this setting: "--mode=async --worker_gpus=2 --worker_split_size=2" Basically I switched to async training. And I got following error at the end. And I checked the source codes, it seems like for async, I will have to set up each job myself?
This is my first time using Lingvo and I hope you could offer some help here.
Traceback (most recent call last):
File "/usr/lib/python3.6/runpy.py", line 193, in _run_module_as_main
"main", mod_spec)
File "/usr/lib/python3.6/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/usr/local/lib/python3.6/dist-packages/lingvo/trainer.py", line 1855, in
tl;dr you want worker_split_size=1
A split is one minibatch of inputs. With worker_split_size=2 you're saying that each minibatch should be given two gpus, but in this case the model you're using needs to explicitly set the devices to take advantage of the two gpus
example bidi-rnn where the forward and backward rnns are on different devices https://github.com/tensorflow/lingvo/blob/ac6adce5ba868e45f115781bb74001db26ce0195/lingvo/core/rnn_layers.py#L476
The number of minibatches being processed concurrently will be worker_replicas * worker_gpus / worker_split_size.
Thanks for the response.
I tried the worker_split_size=1 trick.
So be specific: My whole command line is as such:
python -m lingvo.trainer \
--logdir=$LOG_DIR \
--model=imagenet.Imagenet \
--resnet_depth=34 \
--run_locally=gpu \
--tfrecord_pattern=$TFRECORD \
--mode=async \
--worker_gpus=2 \
--worker_split_size=1 \
And I also tried both mode=sync and mode=async with worker_split_size=1.
When "mode=sync, worker_gpus=2, worker_split_size=1", I ran into this issue:
Traceback (most recent call last): File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py", line 1619, in _create_c_op c_op = c_api.TF_FinishOperation(op_desc) tensorflow.python.framework.errors_impl.InvalidArgumentError: slice index 0 of dimension 0 out of bounds. for 'strided_slice' (op: 'StridedSlice') with input shapes: [0], [1], [1], [1] and with computed input tensors: input[1] = <0>, input[2] = <1>, input[3] = <1>.
When "mode=async, worker_gpus=2, worker_split_size=1", I ran into the same issue as before:
Traceback (most recent call last):
File "/usr/lib/python3.6/runpy.py", line 193, in _run_module_as_main
"main", mod_spec)
File "/usr/lib/python3.6/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/usr/local/lib/python3.6/dist-packages/lingvo/trainer.py", line 1855, in
For the sync case, can you paste more of the error (eg. where the slice operation is defined)
For the async case, I think there is a bug here
https://github.com/tensorflow/lingvo/blob/1aba0c93ae9592af88e43460aad9d19fa5b87e5a/lingvo/trainer.py#L1677
Should be
elif FLAGS.mode == 'async':
FLAGS.job = 'controller,trainer'
else:
FLAGS.job = 'controller,trainer_client'
You can try to fix this locally.
I will patch your change in async asap. Meantime, here comes the full error log for "mode=sync":
I can confirm that after packing the the mode=async code path, it will end up with the same error as mode=sync above.
It looks like it's trying to split the input batch onto the two devices and failing.
Can you make sure that everything in the input batch being returned from your input generator has a leading batch dimension?
One way to check is to add here https://github.com/tensorflow/lingvo/blob/1aba0c93ae9592af88e43460aad9d19fa5b87e5a/lingvo/core/base_input_generator.py#L328
for k, v in batch.FlattenItems():
print(k, v.shape)
Thank you so much. Upon printing these tensors, I figure out I have the following offender:
def InputBatch(self): batch = py_utils.NestedMap() batch.bucket_keys = 1 # commenting this out will fix it batch.rgb = self._rgb batch.label = self._label return batch