benchmarks icon indicating copy to clipboard operation
benchmarks copied to clipboard

SSD does not converge on multi-nodes using horovod

Open YangFei1990 opened this issue 6 years ago • 5 comments

I was running SSD on 4 nodes, each has 8 GPUs Here is my running code: (I can't include the mpirun or the format will be bad)

` mpirun

-np 32 \
-hostfile hosts4 \
-bind-to none -map-by slot -x NCCL_MIN_NRINGS=8 \
-x NCCL_DEBUG=INFO -x LD_LIBRARY_PATH -x PATH \
-x PYTHONPATH="/home/ubuntu/src/cntk/bindings/python:/home/ubuntu/models:/home/ubuntu/models/research:/home/ubuntu/coco/cocoapi/PythonAPI" \
-mca pml ob1 -mca btl ^openib \
-mca btl_tcp_if_exclude lo,docker0 -x NCCL_SOCKET_IFNAME=ens3 \
-mca plm_rsh_no_tree_spawn 1 -x TF_CPP_MIN_LOG_LEVEL=0 \
 python $HOME/${TF_BENCHMARKS_DIR}/scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py \
--model=ssd300 \
--data_name=coco \
--data_dir=/home/ubuntu/coco \
--backbone_model_path=${TF_BENCHMARKS_DIR}/scripts/tf_cnn_benchmarks/resnet34_checkpoint/model.ckpt-28152 \ 
--optimizer=momentum \ 
--momentum=0.9 \ 
--weight_decay=5e-4 \ 
--num_gpus=1 \
--batch_size=32 \
--use_fp16 \
--num_epochs=80 \
--num_eval_epochs=1.9 \
--num_warmup_batches=0 \
--eval_during_training_at_specified_steps='2500,3500,5000,7500,12500,12707,15000' \
--variable_update=horovod \
--gradient_repacking=2 \
--stop_at_top_1_accuracy=0.212 \
--ml_perf_compliance_logging \
--loss_type_to_report=base_loss  \
--single_l2_loss_op \
--compute_lr_on_cpu \
--xla=True  \
--datasets_num_private_threads=8 \
--fp16_loss_scale=256.

The final accuracy is less than 0.2. I can converge to the ml-perf standard successfully on a single node using both horovod and replicated. Please let me know if you need any further information.

YangFei1990 avatar Apr 10 '19 01:04 YangFei1990

We did not use Horovod for mlperf and do not test it internally. You would need to ask Uber/Horovod team. @reedwm did have a fix for Eval, but I doubt that impacts your situation.

tfboyd avatar Apr 10 '19 01:04 tfboyd

@tfboyd Thanks! Do you have the run script for multi-node without using Hororvod? Also, can you let me know what is fixed for Eval? I doubt if I have the fix in my code.

YangFei1990 avatar Apr 10 '19 01:04 YangFei1990

We are not currently running it multi-node. Not a great answer but I did not want to leave you without some answer and wondering. @reedwm would have to let you know on the eval to be sure.

tfboyd avatar Apr 10 '19 16:04 tfboyd

@YangFei1990 can you share what options you are using to run mlperf with TF Distribution Strategy and Horovod?

I can converge to the ml-perf standard successfully on a single node using both horovod and replicated.

den-run-ai avatar Jul 31 '19 17:07 den-run-ai

I was running SSD on 4 nodes, each has 8 GPUs Here is my running code: (I can't include the mpirun or the format will be bad)

` mpirun

-np 32 \
-hostfile hosts4 \
-bind-to none -map-by slot -x NCCL_MIN_NRINGS=8 \
-x NCCL_DEBUG=INFO -x LD_LIBRARY_PATH -x PATH \
-x PYTHONPATH="/home/ubuntu/src/cntk/bindings/python:/home/ubuntu/models:/home/ubuntu/models/research:/home/ubuntu/coco/cocoapi/PythonAPI" \
-mca pml ob1 -mca btl ^openib \
-mca btl_tcp_if_exclude lo,docker0 -x NCCL_SOCKET_IFNAME=ens3 \
-mca plm_rsh_no_tree_spawn 1 -x TF_CPP_MIN_LOG_LEVEL=0 \
 python $HOME/${TF_BENCHMARKS_DIR}/scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py \
--model=ssd300 \
--data_name=coco \
--data_dir=/home/ubuntu/coco \
--backbone_model_path=${TF_BENCHMARKS_DIR}/scripts/tf_cnn_benchmarks/resnet34_checkpoint/model.ckpt-28152 \ 
--optimizer=momentum \ 
--momentum=0.9 \ 
--weight_decay=5e-4 \ 
--num_gpus=1 \
--batch_size=32 \
--use_fp16 \
--num_epochs=80 \
--num_eval_epochs=1.9 \
--num_warmup_batches=0 \
--eval_during_training_at_specified_steps='2500,3500,5000,7500,12500,12707,15000' \
--variable_update=horovod \
--gradient_repacking=2 \
--stop_at_top_1_accuracy=0.212 \
--ml_perf_compliance_logging \
--loss_type_to_report=base_loss  \
--single_l2_loss_op \
--compute_lr_on_cpu \
--xla=True  \
--datasets_num_private_threads=8 \
--fp16_loss_scale=256.

The final accuracy is less than 0.2. I can converge to the ml-perf standard successfully on a single node using both horovod and replicated. Please let me know if you need any further information.

We are not currently running it multi-node. Not a great answer but I did not want to leave you without some answer and wondering. @reedwm would have to let you know on the eval to be sure.

HMAR

Hello, I want use this script to run ssd-resnet34, but I don't find the resnet34 backbone model(Invalid original url). Can you share the backbone model to me?

wsl-inspur avatar Dec 22 '21 08:12 wsl-inspur