DSA2F
DSA2F copied to clipboard
Argument Values for Pretraining Script
I am trying to replicate the experiment by running the pretraining script. This is what I have done till now:
- Downloaded the ILSVRC 2017 dataset from ImageNet website and extracted it.
- Run the pretraining script by changing the dataset path from the file and setting
-n 2 -g 2
.
This setting is giving me a timeout error when initializing the Pytorch distributed process group. Can you provide which parameters you used while training?
Thank you
Error:
Traceback (most recent call last):
File "imagenet_pretrain.py", line 424, in <module>
main()
File "imagenet_pretrain.py", line 421, in main
mp.spawn(main_worker, nprocs=args.gpus, args=(args,))
File "/home/shubhanshu/.local/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 240, in spawn
return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
File "/home/shubhanshu/.local/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 198, in start_processes
while not context.join():
File "/home/shubhanshu/.local/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 160, in join
raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException:
-- Process 1 terminated with the following error:
Traceback (most recent call last):
File "/home/shubhanshu/.local/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 69, in _wrap
fn(i, *args)
File "/home/shubhanshu/DSA2F/imagenet_pretrain.py", line 256, in main_worker
rank=args.rank)
File "/home/shubhanshu/.local/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 627, in init_process_group
_store_based_barrier(rank, store, timeout)
File "/home/shubhanshu/.local/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 258, in _store_based_barrier
rank, store_key, world_size, worker_count, timeout
RuntimeError: Timed out initializing process group in store based barrier on rank: 1, for key: store_based_barrier_key:1 (world_size=4, worker_count=2, timeout=0:30:00)