Cant reduce the batch size

Open Mukil07 opened this issue 2 years ago • 0 comments

My setup is having 8 Titan X GPUs, when i tried to set --ref 32 it gives this error,

/var/spool/slurm/slurmd/job86812/slurm_script: line 50: $benchmarch_logs: ambiguous redirect Traceback (most recent call last): File "/home/mu480317/ODISE/./tools/train_net.py", line 392, in launch( File "/home/mu480317/.conda/envs/ODISE/lib/python3.9/site-packages/detectron2/engine/launch.py", line 67, in launch mp.spawn( File "/home/mu480317/.conda/envs/ODISE/lib/python3.9/site-packages/torch/multiprocessing/spawn.py", line 240, in spawn return start_processes(fn, args, nprocs, join, daemon, start_method='spawn') File "/home/mu480317/.conda/envs/ODISE/lib/python3.9/site-packages/torch/multiprocessing/spawn.py", line 198, in start_processes while not context.join(): File "/home/mu480317/.conda/envs/ODISE/lib/python3.9/site-packages/torch/multiprocessing/spawn.py", line 160, in join raise ProcessRaisedException(msg, error_index, failed_process.pid) torch.multiprocessing.spawn.ProcessRaisedException:

-- Process 5 terminated with the following error: Traceback (most recent call last): File "/home/mu480317/.conda/envs/ODISE/lib/python3.9/site-packages/torch/multiprocessing/spawn.py", line 69, in _wrap fn(i, *args) File "/home/mu480317/.conda/envs/ODISE/lib/python3.9/site-packages/detectron2/engine/launch.py", line 126, in _distributed_worker main_func(*args) File "/home/mu480317/ODISE/tools/train_net.py", line 319, in main cfg = auto_scale_workers(cfg, comm.get_world_size()) File "/home/mu480317/ODISE/odise/config/utils.py", line 65, in auto_scale_workers assert cfg.dataloader.train.total_batch_size % old_world_size == 0, ( AssertionError: Invalid reference_world_size in config! 8 % 32 != 0

When --ref 8 , then the GPU memory is overflowing.

Please help me solve this. Thank you

Jun 04 '23 06:06 Mukil07