insightface
insightface copied to clipboard
why 1machine (TITAN RTX ) +1 machine( RTX 3060) training time are slower any one machine
python -m torch.distributed.launch --nproc_per_node=1 --nnodes=2 --node_rank=0 --master_addr="192.168.8.131" --master_port=12581 train.py configs/ms1mv2_mbf
python -m torch.distributed.launch --nproc_per_node=1 --nnodes=2 --node_rank=1 --master_addr="192.168.8.131" --master_port=12581 train.py configs/ms1mv2_mbf
/home/pc/anaconda3/envs/face19/lib/python3.9/site-packages/torch/distributed/launch.py:163: DeprecationWarning: The 'warn' method is deprecated, use 'warning' instead
logger.warn(
The module torch.distributed.launch is deprecated and going to be removed in future.Migrate to torch.distributed.run
WARNING:torch.distributed.run:--use_env is deprecated and will be removed in future releases.
Please read local_rank from os.environ('LOCAL_RANK')
instead.
INFO:torch.distributed.launcher.api:Starting elastic_operator with launch configs:
entrypoint : train.py
min_nodes : 2
max_nodes : 2
nproc_per_node : 1
run_id : none
rdzv_backend : static
rdzv_endpoint : 192.168.8.131:12581
rdzv_configs : {'rank': 0, 'timeout': 900}
max_restarts : 3
monitor_interval : 5
log_dir : None
metrics_cfg : {}
INFO:torch.distributed.elastic.agent.server.local_elastic_agent:log directory set to: /tmp/torchelastic_4a5rychg/none__fkba0g3 INFO:torch.distributed.elastic.agent.server.api:[default] starting workers for entrypoint: python INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous'ing worker group /home/pc/anaconda3/envs/face19/lib/python3.9/site-packages/torch/distributed/elastic/utils/store.py:52: FutureWarning: This is an experimental API and will be changed in future. warnings.warn( INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous complete for workers. Result: restart_count=0 master_addr=192.168.8.131 master_port=12581 group_rank=0 group_world_size=2 local_ranks=[0] role_ranks=[0] global_ranks=[0] role_world_sizes=[2] global_world_sizes=[2]
INFO:torch.distributed.elastic.agent.server.api:[default] Starting worker group
INFO:torch.distributed.elastic.multiprocessing:Setting worker0 reply file to: /tmp/torchelastic_4a5rychg/none__fkba0g3/attempt_0/0/error.json
0
0
Training: 2022-09-15 11:06:52,012-rank_id: 0
Training: 2022-09-15 11:06:55,830-: margin_list [1.0, 0.5, 0.0]
Training: 2022-09-15 11:06:55,830-: network mbf
Training: 2022-09-15 11:06:55,834-: resume False
Training: 2022-09-15 11:06:55,834-: save_all_states False
Training: 2022-09-15 11:06:55,834-: output work_dirs/ms1mv2_mbf
Training: 2022-09-15 11:06:55,834-: embedding_size 512
Training: 2022-09-15 11:06:55,834-: sample_rate 1.0
Training: 2022-09-15 11:06:55,834-: interclass_filtering_threshold0
Training: 2022-09-15 11:06:55,834-: fp16 True
Training: 2022-09-15 11:06:55,834-: batch_size 256
Training: 2022-09-15 11:06:55,834-: optimizer sgd
Training: 2022-09-15 11:06:55,834-: lr 0.1
Training: 2022-09-15 11:06:55,834-: momentum 0.9
Training: 2022-09-15 11:06:55,834-: weight_decay 0.0001
Training: 2022-09-15 11:06:55,834-: verbose 2000
Training: 2022-09-15 11:06:55,834-: frequent 10
Training: 2022-09-15 11:06:55,834-: dali False
Training: 2022-09-15 11:06:55,834-: gradient_acc 1
Training: 2022-09-15 11:06:55,834-: seed 2048
Training: 2022-09-15 11:06:55,834-: num_workers 4
Training: 2022-09-15 11:06:55,834-: rec /home/pc/faces_webface_112x112
Training: 2022-09-15 11:06:55,834-: num_classes 10572
Training: 2022-09-15 11:06:55,834-: num_image 494194
Training: 2022-09-15 11:06:55,834-: num_epoch 40
Training: 2022-09-15 11:06:55,835-: warmup_epoch 0
Training: 2022-09-15 11:06:55,835-: val_targets ['lfw', 'cfp_fp', 'agedb_30']
Training: 2022-09-15 11:06:55,835-: total_batch_size 512
Training: 2022-09-15 11:06:55,835-: warmup_step 0
Training: 2022-09-15 11:06:55,835-: total_step 38600
loading bin 0
loading bin 1000
loading bin 2000
loading bin 3000
loading bin 4000
loading bin 5000
loading bin 6000
loading bin 7000
loading bin 8000
loading bin 9000
loading bin 10000
loading bin 11000
torch.Size([12000, 3, 112, 112])
loading bin 0
loading bin 1000
loading bin 2000
loading bin 3000
loading bin 4000
loading bin 5000
loading bin 6000
loading bin 7000
loading bin 8000
loading bin 9000
loading bin 10000
loading bin 11000
loading bin 12000
loading bin 13000
torch.Size([14000, 3, 112, 112])
loading bin 0
loading bin 1000
loading bin 2000
loading bin 3000
loading bin 4000
loading bin 5000
loading bin 6000
loading bin 7000
loading bin 8000
loading bin 9000
loading bin 10000
loading bin 11000
torch.Size([12000, 3, 112, 112])
/home/pc/fc/face/insightface/recognition/arcface_torch/train.py:163: FutureWarning: Non-finite norm encountered in torch.nn.utils.clip_grad_norm_; continuing anyway. Note that the default behavior will change in a future release to error out if a non-finite total norm is encountered. At that point, setting error_if_nonfinite=false will be required to retain the old behavior.
torch.nn.utils.clip_grad_norm_(backbone.parameters(), 5)
/home/pc/anaconda3/envs/face19/lib/python3.9/site-packages/torch/optim/lr_scheduler.py:129: UserWarning: Detected call of lr_scheduler.step()
before optimizer.step()
. In PyTorch 1.1.0 and later, you should call them in the opposite order: optimizer.step()
before lr_scheduler.step()
. Failure to do this will result in PyTorch skipping the first value of the learning rate schedule. See more details at https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate
warnings.warn("Detected call of lr_scheduler.step()
before optimizer.step()
. "
Training: 2022-09-15 11:07:37,277-Reducer buckets have been rebuilt in this iteration.
Training: 2022-09-15 11:07:55,067-Speed 518.42 samples/sec Loss 44.2595 LearningRate 0.099902 Epoch: 0 Global Step: 20 Fp16 Grad Scale: 8192 Required: 13 hours
Training: 2022-09-15 11:08:04,952-Speed 517.94 samples/sec Loss 45.0456 LearningRate 0.099850 Epoch: 0 Global Step: 30 Fp16 Grad Scale: 8192 Required: 12 hours
Training: 2022-09-15 11:08:14,893-Speed 515.12 samples/sec Loss 45.5388 LearningRate 0.099798 Epoch: 0 Global Step: 40 Fp16 Grad Scale: 8192 Required: 12 hours
Training: 2022-09-15 11:08:24,767-Speed 518.53 samples/sec Loss 45.7875 LearningRate 0.099746 Epoch: 0 Global Step: 50 Fp16 Grad Scale: 8192 Required: 12 hours
Training: 2022-09-15 11:08:34,667-Speed 517.22 samples/sec Loss 45.5845 LearningRate 0.099695 Epoch: 0 Global Step: 60 Fp16 Grad Scale: 8192 Required: 11 hours
Training: 2022-09-15 11:08:44,533-Speed 518.98 samples/sec Loss 45.6968 LearningRate 0.099643 Epoch: 0 Global Step: 70 Fp16 Grad Scale: 8192 Required: 11 hours
(face19) ubuntu@ubuntu-X10SRA:~/fc/face/insightface/recognition/arcface_torch$ python -m torch.distributed.launch --nproc_per_node=1 --nnodes=2 --node_rank=1 --master_addr="192.168.8.131" --master_port=12581 train.py configs/ms1mv2_mbf
/home/ubuntu/anaconda3/envs/face19/lib/python3.9/site-packages/torch/distributed/launch.py:163: DeprecationWarning: The 'warn' method is deprecated, use 'warning' instead
logger.warn(
The module torch.distributed.launch is deprecated and going to be removed in future.Migrate to torch.distributed.run
WARNING:torch.distributed.run:--use_env is deprecated and will be removed in future releases.
Please read local_rank from os.environ('LOCAL_RANK')
instead.
INFO:torch.distributed.launcher.api:Starting elastic_operator with launch configs:
entrypoint : train.py
min_nodes : 2
max_nodes : 2
nproc_per_node : 1
run_id : none
rdzv_backend : static
rdzv_endpoint : 192.168.8.131:12581
rdzv_configs : {'rank': 1, 'timeout': 900}
max_restarts : 3
monitor_interval : 5
log_dir : None
metrics_cfg : {}
INFO:torch.distributed.elastic.agent.server.local_elastic_agent:log directory set to: /tmp/torchelastic_bcc_b24k/none_nbf6ckxx INFO:torch.distributed.elastic.agent.server.api:[default] starting workers for entrypoint: python INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous'ing worker group /home/ubuntu/anaconda3/envs/face19/lib/python3.9/site-packages/torch/distributed/elastic/utils/store.py:52: FutureWarning: This is an experimental API and will be changed in future. warnings.warn( INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous complete for workers. Result: restart_count=0 master_addr=192.168.8.131 master_port=12581 group_rank=1 group_world_size=2 local_ranks=[0] role_ranks=[1] global_ranks=[1] role_world_sizes=[2] global_world_sizes=[2]
INFO:torch.distributed.elastic.agent.server.api:[default] Starting worker group
INFO:torch.distributed.elastic.multiprocessing:Setting worker0 reply file to: /tmp/torchelastic_bcc_b24k/none_nbf6ckxx/attempt_0/0/error.json
sgd
/home/ubuntu/fc/face/insightface/recognition/arcface_torch/train.py:166: FutureWarning: Non-finite norm encountered in torch.nn.utils.clip_grad_norm_; continuing anyway. Note that the default behavior will change in a future release to error out if a non-finite total norm is encountered. At that point, setting error_if_nonfinite=false will be required to retain the old behavior.
torch.nn.utils.clip_grad_norm_(backbone.parameters(), 5)
/home/ubuntu/anaconda3/envs/face19/lib/python3.9/site-packages/torch/optim/lr_scheduler.py:129: UserWarning: Detected call of lr_scheduler.step()
before optimizer.step()
. In PyTorch 1.1.0 and later, you should call them in the opposite order: optimizer.step()
before lr_scheduler.step()
. Failure to do this will result in PyTorch skipping the first value of the learning rate schedule. See more details at https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate
warnings.warn("Detected call of lr_scheduler.step()
before optimizer.step()
. "
[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)