SoftTeacher icon indicating copy to clipboard operation
SoftTeacher copied to clipboard

Host memory occupancy is too high

Open MingXiangL opened this issue 3 years ago • 1 comments

当使用8*2080Ti (12GB)+ 126GB内存进行训练时,报错MemoryError:

Traceback (most recent call last): File "tools/train.py", line 198, in main() File "tools/train.py", line 193, in main meta=meta, File "/home/ubuntu/Workspace/SoftTeacher-main/ssod/apis/train.py", line 206, in train_detector runner.run(data_loaders, cfg.workflow) File "/home/miniconda3/envs/mmdet/lib/python3.7/site-packages/mmcv/runner/iter_based_runner.py", line 117, in run iter_loaders = [IterLoader(x) for x in data_loaders] File "/home/miniconda3/envs/mmdet/lib/python3.7/site-packages/mmcv/runner/iter_based_runner.py", line 117, in iter_loaders = [IterLoader(x) for x in data_loaders] File "/home/miniconda3/envs/mmdet/lib/python3.7/site-packages/mmcv/runner/iter_based_runner.py", line 23, in init self.iter_loader = iter(self._dataloader) File "/home/miniconda3/envs/mmdet/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 359, in iter return self._get_iterator() File "/home/miniconda3/envs/mmdet/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 305, in _get_iterator return _MultiProcessingDataLoaderIter(self) File "/home/miniconda3/envs/mmdet/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 918, in init w.start() File "/home/miniconda3/envs/mmdet/lib/python3.7/multiprocessing/process.py", line 112, in start self._popen = self._Popen(self) File "/home/miniconda3/envs/mmdet/lib/python3.7/multiprocessing/context.py", line 223, in _Popen return _default_context.get_context().Process._Popen(process_obj) File "/home/miniconda3/envs/mmdet/lib/python3.7/multiprocessing/context.py", line 284, in _Popen return Popen(process_obj) File "/home/miniconda3/envs/mmdet/lib/python3.7/multiprocessing/popen_spawn_posix.py", line 32, in init super().init(process_obj) File "/home/miniconda3/envs/mmdet/lib/python3.7/multiprocessing/popen_fork.py", line 20, in init self._launch(process_obj) File "/home/miniconda3/envs/mmdet/lib/python3.7/multiprocessing/popen_spawn_posix.py", line 47, in _launch reduction.dump(process_obj, fp) File "/home/miniconda3/envs/mmdet/lib/python3.7/multiprocessing/reduction.py", line 60, in dump ForkingPickler(file, protocol).dump(obj) MemoryError Traceback (most recent call last): File "", line 1, in File "/home/miniconda3/envs/mmdet/lib/python3.7/multiprocessing/spawn.py", line 105, in spawn_main exitcode = _main(fd) File "/home/miniconda3/envs/mmdet/lib/python3.7/multiprocessing/spawn.py", line 115, in _main self = reduction.pickle.load(from_parent) _pickle.UnpicklingError: pickle data was truncated Traceback (most recent call last): File "", line 1, in File "/home/miniconda3/envs/mmdet/lib/python3.7/multiprocessing/spawn.py", line 105, in spawn_main exitcode = _main(fd) File "/home/miniconda3/envs/mmdet/lib/python3.7/multiprocessing/spawn.py", line 115, in _main self = reduction.pickle.load(from_parent) _pickle.UnpicklingError: pickle data was truncated Traceback (most recent call last): File "", line 1, in File "/home/miniconda3/envs/mmdet/lib/python3.7/multiprocessing/spawn.py", line 105, in spawn_main exitcode = _main(fd) File "/home/miniconda3/envs/mmdet/lib/python3.7/multiprocessing/spawn.py", line 115, in _main self = reduction.pickle.load(from_parent) _pickle.UnpicklingError: pickle data was truncated Traceback (most recent call last): File "", line 1, in File "/home/miniconda3/envs/mmdet/lib/python3.7/multiprocessing/spawn.py", line 105, in spawn_main exitcode = _main(fd) File "/home/miniconda3/envs/mmdet/lib/python3.7/multiprocessing/spawn.py", line 115, in _main self = reduction.pickle.load(from_parent) _pickle.UnpicklingError: pickle data was truncated ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 4 (pid: 4682) of binary: /home/miniconda3/envs/mmdet/bin/python ······

我们检查了内存占用情况,发现126GB内存不足: 微信图片_20220116094606

若将dataloader的线程减少(5->3),或者GPU数量减少(8->4)问题可以解决: 微信图片_20220116094950

但是这种办法还是使得8卡的服务器只剩下4卡可用(内存不足了),请问有什么办法可以解决吗

MingXiangL avatar Jan 16 '22 01:01 MingXiangL

1)缩小augmentation里面的resize的尺寸(例如strong aug改成resize小尺寸或者resize+random crop,weak aug改成固定尺寸(不影响性能); 2)改用其他的augmentation方式,例如Nvidia DALI; 3)缩小batchsize, 对应地减小lr,增长itereation。

MendelXu avatar Jan 17 '22 03:01 MendelXu