insightface why 1machine (TITAN RTX ) +1 machine( RTX 3060) training time are slower any one machine

why 1machine (TITAN RTX ) +1 machine( RTX 3060) training time are slower any one machine

Open wavelet2008 opened this issue 2 years ago • 0 comments

python -m torch.distributed.launch --nproc_per_node=1 --nnodes=2 --node_rank=0 --master_addr="192.168.8.131" --master_port=12581 train.py configs/ms1mv2_mbf

python -m torch.distributed.launch --nproc_per_node=1 --nnodes=2 --node_rank=1 --master_addr="192.168.8.131" --master_port=12581 train.py configs/ms1mv2_mbf

/home/pc/anaconda3/envs/face19/lib/python3.9/site-packages/torch/distributed/launch.py:163: DeprecationWarning: The 'warn' method is deprecated, use 'warning' instead logger.warn( The module torch.distributed.launch is deprecated and going to be removed in future.Migrate to torch.distributed.run WARNING:torch.distributed.run:--use_env is deprecated and will be removed in future releases. Please read local_rank from os.environ('LOCAL_RANK') instead. INFO:torch.distributed.launcher.api:Starting elastic_operator with launch configs: entrypoint : train.py min_nodes : 2 max_nodes : 2 nproc_per_node : 1 run_id : none rdzv_backend : static rdzv_endpoint : 192.168.8.131:12581 rdzv_configs : {'rank': 0, 'timeout': 900} max_restarts : 3 monitor_interval : 5 log_dir : None metrics_cfg : {}

INFO:torch.distributed.elastic.agent.server.local_elastic_agent:log directory set to: /tmp/torchelastic_4a5rychg/none__fkba0g3 INFO:torch.distributed.elastic.agent.server.api:[default] starting workers for entrypoint: python INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous'ing worker group /home/pc/anaconda3/envs/face19/lib/python3.9/site-packages/torch/distributed/elastic/utils/store.py:52: FutureWarning: This is an experimental API and will be changed in future. warnings.warn( INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous complete for workers. Result: restart_count=0 master_addr=192.168.8.131 master_port=12581 group_rank=0 group_world_size=2 local_ranks=[0] role_ranks=[0] global_ranks=[0] role_world_sizes=[2] global_world_sizes=[2]

INFO:torch.distributed.elastic.agent.server.api:[default] Starting worker group INFO:torch.distributed.elastic.multiprocessing:Setting worker0 reply file to: /tmp/torchelastic_4a5rychg/none__fkba0g3/attempt_0/0/error.json 0 0 Training: 2022-09-15 11:06:52,012-rank_id: 0 Training: 2022-09-15 11:06:55,830-: margin_list [1.0, 0.5, 0.0] Training: 2022-09-15 11:06:55,830-: network mbf Training: 2022-09-15 11:06:55,834-: resume False Training: 2022-09-15 11:06:55,834-: save_all_states False Training: 2022-09-15 11:06:55,834-: output work_dirs/ms1mv2_mbf Training: 2022-09-15 11:06:55,834-: embedding_size 512 Training: 2022-09-15 11:06:55,834-: sample_rate 1.0 Training: 2022-09-15 11:06:55,834-: interclass_filtering_threshold0 Training: 2022-09-15 11:06:55,834-: fp16 True Training: 2022-09-15 11:06:55,834-: batch_size 256 Training: 2022-09-15 11:06:55,834-: optimizer sgd Training: 2022-09-15 11:06:55,834-: lr 0.1 Training: 2022-09-15 11:06:55,834-: momentum 0.9 Training: 2022-09-15 11:06:55,834-: weight_decay 0.0001 Training: 2022-09-15 11:06:55,834-: verbose 2000 Training: 2022-09-15 11:06:55,834-: frequent 10 Training: 2022-09-15 11:06:55,834-: dali False Training: 2022-09-15 11:06:55,834-: gradient_acc 1 Training: 2022-09-15 11:06:55,834-: seed 2048 Training: 2022-09-15 11:06:55,834-: num_workers 4 Training: 2022-09-15 11:06:55,834-: rec /home/pc/faces_webface_112x112 Training: 2022-09-15 11:06:55,834-: num_classes 10572 Training: 2022-09-15 11:06:55,834-: num_image 494194 Training: 2022-09-15 11:06:55,834-: num_epoch 40 Training: 2022-09-15 11:06:55,835-: warmup_epoch 0 Training: 2022-09-15 11:06:55,835-: val_targets ['lfw', 'cfp_fp', 'agedb_30'] Training: 2022-09-15 11:06:55,835-: total_batch_size 512 Training: 2022-09-15 11:06:55,835-: warmup_step 0 Training: 2022-09-15 11:06:55,835-: total_step 38600 loading bin 0 loading bin 1000 loading bin 2000 loading bin 3000 loading bin 4000 loading bin 5000 loading bin 6000 loading bin 7000 loading bin 8000 loading bin 9000 loading bin 10000 loading bin 11000 torch.Size([12000, 3, 112, 112]) loading bin 0 loading bin 1000 loading bin 2000 loading bin 3000 loading bin 4000 loading bin 5000 loading bin 6000 loading bin 7000 loading bin 8000 loading bin 9000 loading bin 10000 loading bin 11000 loading bin 12000 loading bin 13000 torch.Size([14000, 3, 112, 112]) loading bin 0 loading bin 1000 loading bin 2000 loading bin 3000 loading bin 4000 loading bin 5000 loading bin 6000 loading bin 7000 loading bin 8000 loading bin 9000 loading bin 10000 loading bin 11000 torch.Size([12000, 3, 112, 112]) /home/pc/fc/face/insightface/recognition/arcface_torch/train.py:163: FutureWarning: Non-finite norm encountered in torch.nn.utils.clip_grad_norm_; continuing anyway. Note that the default behavior will change in a future release to error out if a non-finite total norm is encountered. At that point, setting error_if_nonfinite=false will be required to retain the old behavior. torch.nn.utils.clip_grad_norm_(backbone.parameters(), 5) /home/pc/anaconda3/envs/face19/lib/python3.9/site-packages/torch/optim/lr_scheduler.py:129: UserWarning: Detected call of lr_scheduler.step() before optimizer.step(). In PyTorch 1.1.0 and later, you should call them in the opposite order: optimizer.step() before lr_scheduler.step(). Failure to do this will result in PyTorch skipping the first value of the learning rate schedule. See more details at https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate warnings.warn("Detected call of lr_scheduler.step() before optimizer.step(). " Training: 2022-09-15 11:07:37,277-Reducer buckets have been rebuilt in this iteration. Training: 2022-09-15 11:07:55,067-Speed 518.42 samples/sec Loss 44.2595 LearningRate 0.099902 Epoch: 0 Global Step: 20 Fp16 Grad Scale: 8192 Required: 13 hours Training: 2022-09-15 11:08:04,952-Speed 517.94 samples/sec Loss 45.0456 LearningRate 0.099850 Epoch: 0 Global Step: 30 Fp16 Grad Scale: 8192 Required: 12 hours Training: 2022-09-15 11:08:14,893-Speed 515.12 samples/sec Loss 45.5388 LearningRate 0.099798 Epoch: 0 Global Step: 40 Fp16 Grad Scale: 8192 Required: 12 hours Training: 2022-09-15 11:08:24,767-Speed 518.53 samples/sec Loss 45.7875 LearningRate 0.099746 Epoch: 0 Global Step: 50 Fp16 Grad Scale: 8192 Required: 12 hours Training: 2022-09-15 11:08:34,667-Speed 517.22 samples/sec Loss 45.5845 LearningRate 0.099695 Epoch: 0 Global Step: 60 Fp16 Grad Scale: 8192 Required: 11 hours Training: 2022-09-15 11:08:44,533-Speed 518.98 samples/sec Loss 45.6968 LearningRate 0.099643 Epoch: 0 Global Step: 70 Fp16 Grad Scale: 8192 Required: 11 hours

(face19) ubuntu@ubuntu-X10SRA:~/fc/face/insightface/recognition/arcface_torch$ python -m torch.distributed.launch --nproc_per_node=1 --nnodes=2 --node_rank=1 --master_addr="192.168.8.131" --master_port=12581 train.py configs/ms1mv2_mbf /home/ubuntu/anaconda3/envs/face19/lib/python3.9/site-packages/torch/distributed/launch.py:163: DeprecationWarning: The 'warn' method is deprecated, use 'warning' instead logger.warn( The module torch.distributed.launch is deprecated and going to be removed in future.Migrate to torch.distributed.run WARNING:torch.distributed.run:--use_env is deprecated and will be removed in future releases. Please read local_rank from os.environ('LOCAL_RANK') instead. INFO:torch.distributed.launcher.api:Starting elastic_operator with launch configs: entrypoint : train.py min_nodes : 2 max_nodes : 2 nproc_per_node : 1 run_id : none rdzv_backend : static rdzv_endpoint : 192.168.8.131:12581 rdzv_configs : {'rank': 1, 'timeout': 900} max_restarts : 3 monitor_interval : 5 log_dir : None metrics_cfg : {}

INFO:torch.distributed.elastic.agent.server.local_elastic_agent:log directory set to: /tmp/torchelastic_bcc_b24k/none_nbf6ckxx INFO:torch.distributed.elastic.agent.server.api:[default] starting workers for entrypoint: python INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous'ing worker group /home/ubuntu/anaconda3/envs/face19/lib/python3.9/site-packages/torch/distributed/elastic/utils/store.py:52: FutureWarning: This is an experimental API and will be changed in future. warnings.warn( INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous complete for workers. Result: restart_count=0 master_addr=192.168.8.131 master_port=12581 group_rank=1 group_world_size=2 local_ranks=[0] role_ranks=[1] global_ranks=[1] role_world_sizes=[2] global_world_sizes=[2]

INFO:torch.distributed.elastic.agent.server.api:[default] Starting worker group INFO:torch.distributed.elastic.multiprocessing:Setting worker0 reply file to: /tmp/torchelastic_bcc_b24k/none_nbf6ckxx/attempt_0/0/error.json sgd /home/ubuntu/fc/face/insightface/recognition/arcface_torch/train.py:166: FutureWarning: Non-finite norm encountered in torch.nn.utils.clip_grad_norm_; continuing anyway. Note that the default behavior will change in a future release to error out if a non-finite total norm is encountered. At that point, setting error_if_nonfinite=false will be required to retain the old behavior. torch.nn.utils.clip_grad_norm_(backbone.parameters(), 5) /home/ubuntu/anaconda3/envs/face19/lib/python3.9/site-packages/torch/optim/lr_scheduler.py:129: UserWarning: Detected call of lr_scheduler.step() before optimizer.step(). In PyTorch 1.1.0 and later, you should call them in the opposite order: optimizer.step() before lr_scheduler.step(). Failure to do this will result in PyTorch skipping the first value of the learning rate schedule. See more details at https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate warnings.warn("Detected call of lr_scheduler.step() before optimizer.step(). " [W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool) [W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool) [W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool) [W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)

Sep 15 '22 03:09 wavelet2008

insightface insightface copied to clipboard

why 1machine (TITAN RTX ) +1 machine( RTX 3060) training time are slower any one machine

insightface
insightface copied to clipboard