benchmarks
benchmarks copied to clipboard
SSD does not converge on multi-nodes using horovod
I was running SSD on 4 nodes, each has 8 GPUs Here is my running code: (I can't include the mpirun or the format will be bad)
` mpirun
-np 32 \
-hostfile hosts4 \
-bind-to none -map-by slot -x NCCL_MIN_NRINGS=8 \
-x NCCL_DEBUG=INFO -x LD_LIBRARY_PATH -x PATH \
-x PYTHONPATH="/home/ubuntu/src/cntk/bindings/python:/home/ubuntu/models:/home/ubuntu/models/research:/home/ubuntu/coco/cocoapi/PythonAPI" \
-mca pml ob1 -mca btl ^openib \
-mca btl_tcp_if_exclude lo,docker0 -x NCCL_SOCKET_IFNAME=ens3 \
-mca plm_rsh_no_tree_spawn 1 -x TF_CPP_MIN_LOG_LEVEL=0 \
python $HOME/${TF_BENCHMARKS_DIR}/scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py \
--model=ssd300 \
--data_name=coco \
--data_dir=/home/ubuntu/coco \
--backbone_model_path=${TF_BENCHMARKS_DIR}/scripts/tf_cnn_benchmarks/resnet34_checkpoint/model.ckpt-28152 \
--optimizer=momentum \
--momentum=0.9 \
--weight_decay=5e-4 \
--num_gpus=1 \
--batch_size=32 \
--use_fp16 \
--num_epochs=80 \
--num_eval_epochs=1.9 \
--num_warmup_batches=0 \
--eval_during_training_at_specified_steps='2500,3500,5000,7500,12500,12707,15000' \
--variable_update=horovod \
--gradient_repacking=2 \
--stop_at_top_1_accuracy=0.212 \
--ml_perf_compliance_logging \
--loss_type_to_report=base_loss \
--single_l2_loss_op \
--compute_lr_on_cpu \
--xla=True \
--datasets_num_private_threads=8 \
--fp16_loss_scale=256.
The final accuracy is less than 0.2. I can converge to the ml-perf standard successfully on a single node using both horovod and replicated. Please let me know if you need any further information.
We did not use Horovod for mlperf and do not test it internally. You would need to ask Uber/Horovod team. @reedwm did have a fix for Eval, but I doubt that impacts your situation.
@tfboyd Thanks! Do you have the run script for multi-node without using Hororvod? Also, can you let me know what is fixed for Eval? I doubt if I have the fix in my code.
We are not currently running it multi-node. Not a great answer but I did not want to leave you without some answer and wondering. @reedwm would have to let you know on the eval to be sure.
@YangFei1990 can you share what options you are using to run mlperf with TF Distribution Strategy and Horovod?
I can converge to the ml-perf standard successfully on a single node using both horovod and replicated.
I was running SSD on 4 nodes, each has 8 GPUs Here is my running code: (I can't include the mpirun or the format will be bad)
` mpirun
-np 32 \ -hostfile hosts4 \ -bind-to none -map-by slot -x NCCL_MIN_NRINGS=8 \ -x NCCL_DEBUG=INFO -x LD_LIBRARY_PATH -x PATH \ -x PYTHONPATH="/home/ubuntu/src/cntk/bindings/python:/home/ubuntu/models:/home/ubuntu/models/research:/home/ubuntu/coco/cocoapi/PythonAPI" \ -mca pml ob1 -mca btl ^openib \ -mca btl_tcp_if_exclude lo,docker0 -x NCCL_SOCKET_IFNAME=ens3 \ -mca plm_rsh_no_tree_spawn 1 -x TF_CPP_MIN_LOG_LEVEL=0 \ python $HOME/${TF_BENCHMARKS_DIR}/scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py \ --model=ssd300 \ --data_name=coco \ --data_dir=/home/ubuntu/coco \ --backbone_model_path=${TF_BENCHMARKS_DIR}/scripts/tf_cnn_benchmarks/resnet34_checkpoint/model.ckpt-28152 \ --optimizer=momentum \ --momentum=0.9 \ --weight_decay=5e-4 \ --num_gpus=1 \ --batch_size=32 \ --use_fp16 \ --num_epochs=80 \ --num_eval_epochs=1.9 \ --num_warmup_batches=0 \ --eval_during_training_at_specified_steps='2500,3500,5000,7500,12500,12707,15000' \ --variable_update=horovod \ --gradient_repacking=2 \ --stop_at_top_1_accuracy=0.212 \ --ml_perf_compliance_logging \ --loss_type_to_report=base_loss \ --single_l2_loss_op \ --compute_lr_on_cpu \ --xla=True \ --datasets_num_private_threads=8 \ --fp16_loss_scale=256.The final accuracy is less than 0.2. I can converge to the ml-perf standard successfully on a single node using both horovod and replicated. Please let me know if you need any further information.
We are not currently running it multi-node. Not a great answer but I did not want to leave you without some answer and wondering. @reedwm would have to let you know on the eval to be sure.
HMAR
Hello, I want use this script to run ssd-resnet34, but I don't find the resnet34 backbone model(Invalid original url). Can you share the backbone model to me?