DSA2F icon indicating copy to clipboard operation
DSA2F copied to clipboard

Argument Values for Pretraining Script

Open shubhanshu02 opened this issue 2 years ago • 0 comments

I am trying to replicate the experiment by running the pretraining script. This is what I have done till now:

  • Downloaded the ILSVRC 2017 dataset from ImageNet website and extracted it.
  • Run the pretraining script by changing the dataset path from the file and setting -n 2 -g 2.

This setting is giving me a timeout error when initializing the Pytorch distributed process group. Can you provide which parameters you used while training?

Thank you

Error:

Traceback (most recent call last):
  File "imagenet_pretrain.py", line 424, in <module>
    main()
  File "imagenet_pretrain.py", line 421, in main
    mp.spawn(main_worker, nprocs=args.gpus, args=(args,))
  File "/home/shubhanshu/.local/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 240, in spawn
    return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
  File "/home/shubhanshu/.local/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 198, in start_processes
    while not context.join():
  File "/home/shubhanshu/.local/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 160, in join
    raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException:

-- Process 1 terminated with the following error:
Traceback (most recent call last):
  File "/home/shubhanshu/.local/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 69, in _wrap
    fn(i, *args)
  File "/home/shubhanshu/DSA2F/imagenet_pretrain.py", line 256, in main_worker
    rank=args.rank)
  File "/home/shubhanshu/.local/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 627, in init_process_group
    _store_based_barrier(rank, store, timeout)
  File "/home/shubhanshu/.local/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 258, in _store_based_barrier
    rank, store_key, world_size, worker_count, timeout
RuntimeError: Timed out initializing process group in store based barrier on rank: 1, for key: store_based_barrier_key:1 (world_size=4, worker_count=2, timeout=0:30:00)

shubhanshu02 avatar Nov 14 '22 18:11 shubhanshu02