lingvo icon indicating copy to clipboard operation
lingvo copied to clipboard

A problem of distributed training

Open maoxuepeng opened this issue 6 years ago • 5 comments

sync mode

server info: controller: server0:2222, 8gpus trainer_client: server1:2222, worker0: server2:2222, 8gpus worker1: server3:2222, 8gpus worker2: server4:2222, 8gpus

follow the setting of "run_distributed.py", it doesn't work. server0# bazel-bin/lingvo/trainer --cluster_spec=controller=server0:2222@trainer_client=server1:2222@worker=server2:2222,server3:2222,server4:2222 --job=controller--task=0 --mode=sync --logtostderr --model=asr.librispeech.Librispeech960Base --logdir=/tmp/sharedfs/log server1# bazel-bin/lingvo/trainer --cluster_spec=controller=server0:2222@trainer_client=server1:2222@worker=server2:2222,server3:2222,server4:2222 --job=trainer_client --task=0 --mode=sync --logtostderr --model=asr.librispeech.Librispeech960Base --logdir=/tmp/sharedfs/log server2# bazel-bin/lingvo/trainer --cluster_spec=controller=server0:2222@trainer_client=server1:2222@worker=server2:2222,server3:2222,server4:2222 --job=worker--task=0 --mode=sync --logtostderr --model=asr.librispeech.Librispeech960Base --logdir=/tmp/sharedfs/log server3# bazel-bin/lingvo/trainer -- cluster_spec=controller=server0:2222@trainer_client=server1:2222@worker=server2:2222,server3:2222,server4:2222 --job=worker--task=1 --mode=sync --logtostderr --model=asr.librispeech.Librispeech960Base --logdir=/tmp/sharedfs/log server4# bazel-bin/lingvo/trainer -- cluster_spec=controller=server0:2222@trainer_client=server1:2222@worker=server2:2222,server3:2222,server4:2222 --job=worker--task=2 --mode=sync --logtostderr --model=asr.librispeech.Librispeech960Base --logdir=/tmp/sharedfs/log

It seems training job run on worker0 only, and GPU utilization is very low. Could you please let me know how to set all these arguments below in my case. --controller_gpus: Number of controller GPUs. (default: '0') (an integer) --worker_gpus: Number of gpus to use per replica. (default: '0') (an integer) --worker_replicas: Number of replicas. (default: '1') (an integer) --worker_split_size: Number of devices for one split. (default: '1') (an integer)

Thanks a billion!

maoxuepeng avatar Aug 01 '19 06:08 maoxuepeng

Sorry, we are aware that run_distributed has some problems, but don't have the resources to fix it at the moment. If someone is able to create a pull request that will be greatly appreciated, otherwise we will take a look when we have time.

jonathanasdf avatar Aug 01 '19 23:08 jonathanasdf

Sorry, we are aware that run_distributed has some problems, but don't have the resources to fix it at the moment. If someone is able to create a pull request that will be greatly appreciated, otherwise we will take a look when we have time.

Thank you for your reply. Ignore that run_distributed. Actually, This is a problem of "lingvo/trainer.py", right?
Could you please let me know how to set all these arguments below in my case. --controller_gpus: Number of controller GPUs. (default: '0') (an integer) --worker_gpus: Number of GPUs to use per replica. (default: '0') (an integer) --worker_replicas: Number of replicas. (default: '1') (an integer) --worker_split_size: Number of devices for one split. (default: '1') (an integer)

maoxuepeng avatar Aug 02 '19 01:08 maoxuepeng

if you have 3 workers with 8 gpus each you should have worker_gpus=8 worker_replicas=3 and the rest default

jonathanasdf avatar Aug 02 '19 02:08 jonathanasdf

@jonathanasdf What is the meaning of worker_gpus and worker_replicas here? Can you share some references to understand this better?

manish-kumar-garg avatar Jan 09 '20 09:01 manish-kumar-garg

It needs to be the same as your physical cluster setup. worker_replicas is the number of training worker jobs you are running. worker_gpus is the number of gpus each training worker job uses.

jonathanasdf avatar Jan 09 '20 21:01 jonathanasdf