Cant reduce the batch size
My setup is having 8 Titan X GPUs, when i tried to set --ref 32 it gives this error,
/var/spool/slurm/slurmd/job86812/slurm_script: line 50: $benchmarch_logs: ambiguous redirect
Traceback (most recent call last):
File "/home/mu480317/ODISE/./tools/train_net.py", line 392, in
-- Process 5 terminated with the following error: Traceback (most recent call last): File "/home/mu480317/.conda/envs/ODISE/lib/python3.9/site-packages/torch/multiprocessing/spawn.py", line 69, in _wrap fn(i, *args) File "/home/mu480317/.conda/envs/ODISE/lib/python3.9/site-packages/detectron2/engine/launch.py", line 126, in _distributed_worker main_func(*args) File "/home/mu480317/ODISE/tools/train_net.py", line 319, in main cfg = auto_scale_workers(cfg, comm.get_world_size()) File "/home/mu480317/ODISE/odise/config/utils.py", line 65, in auto_scale_workers assert cfg.dataloader.train.total_batch_size % old_world_size == 0, ( AssertionError: Invalid reference_world_size in config! 8 % 32 != 0
When --ref 8 , then the GPU memory is overflowing.
Please help me solve this. Thank you