mickey icon indicating copy to clipboard operation
mickey copied to clipboard

Errors in multi-gpu training

Open XJTU-Haolin opened this issue 1 year ago • 1 comments

When I ran multi-gpu training of Mikey using 4*3090, I met the following errors. I never meet such problems when using one GPU. It seems that something wrong with the JPEG images, but the map-free datasets were downloaded without any processing.

./train.sh: line 1: 23 Killed python3 train.py [rank: 3] Child process with PID 27 terminated with code -9. Forcefully terminating all other processes to avoid zombies 🧟 RuntimeError: DataLoader worker (pid 2655) is killed by signal: Killed. _error_if_any_worker_fails() File "/usr/local/lib/python3.8/dist-packages/torch/utils/data/_utils/signal_handling.py", line 66, in handler transform = torch.eye(3) File "/opt/data/private/zhanghaolin_project/local_feature/mickey-main/lib/datasets/utils.py", line 92, in correct_intrinsic_scale K = correct_intrinsic_scale(K, resize[0] / W, resize[1] / H) File "/opt/data/private/zhanghaolin_project/local_feature/mickey-main/lib/datasets/mapfree.py", line 47, in read_intrinsics self.K, self.K_ori = self.read_intrinsics(self.scene_root, resize) File "/opt/data/private/zhanghaolin_project/local_feature/mickey-main/lib/datasets/mapfree.py", line 26, in init MapFreeScene( File "/opt/data/private/zhanghaolin_project/local_feature/mickey-main/lib/datasets/mapfree.py", line 191, in data_srcs = [ File "/opt/data/private/zhanghaolin_project/local_feature/mickey-main/lib/datasets/mapfree.py", line 190, in init dataset = self.dataset_type(self.cfg, 'val') File "/opt/data/private/zhanghaolin_project/local_feature/mickey-main/lib/datasets/datamodules.py", line 107, in val_dataloader return fn(*args, **kwargs) File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/trainer/call.py", line 179, in _call_lightning_datamodule_hook return call._call_lightning_datamodule_hook(self.instance.trainer, self.name) File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/trainer/connectors/data_connector.py", line 309, in dataloader return data_source.dataloader() File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/trainer/connectors/data_connector.py", line 342, in _request_dataloader dataloaders = _request_dataloader(source) File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/loops/evaluation_loop.py", line 166, in setup_data self.epoch_loop.val_loop.setup_data() File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/loops/fit_loop.py", line 324, in on_run_start self.on_run_start() File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/loops/fit_loop.py", line 201, in run self.fit_loop.run() File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/trainer/trainer.py", line 1033, in _run_stage results = self._run_stage() File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/trainer/trainer.py", line 987, in _run self._run(model, ckpt_path=ckpt_path) File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/trainer/trainer.py", line 580, in _fit_impl return function(*args, **kwargs) File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/strategies/launchers/subprocess_script.py", line 105, in launch return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs) File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/trainer/call.py", line 43, in _call_and_handle_interrupt call._call_and_handle_interrupt( File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/trainer/trainer.py", line 544, in fit trainer.fit(model, datamodule_end, ckpt_path=ckpt_path) File "/opt/data/private/zhanghaolin_project/local_feature/mickey-main/train.py", line 89, in train_model train_model(args) File "/opt/data/private/zhanghaolin_project/local_feature/mickey-main/train.py", line 99, in Traceback (most recent call last): Premature end of JPEG file Premature end of JPEG file Premature end of JPEG file Premature end of JPEG file Premature end of JPEG file Premature end of JPEG file Premature end of JPEG file Premature end of JPEG file Training with 0.00/1.00 image overlap

Could you give me any instructions?

Thanks for your time!

Haolin

XJTU-Haolin avatar Jul 17 '24 07:07 XJTU-Haolin

Hello, sorry for the late response. Is this problem solved? I couldn't replicate it.

It seems that the problem is on the validation data, and not on the training. Have you verified that the path to the validation images and intrinsics is correct?

axelBarroso avatar Aug 13 '24 13:08 axelBarroso

Hello, sorry for the late response. Is this problem solved? I couldn't replicate it.

It seems that the problem is on the validation data, and not on the training. Have you verified that the path to the validation images and intrinsics is correct?

I will check it again. Thanks for your reply!

XJTU-Haolin avatar Aug 28 '24 09:08 XJTU-Haolin

Closing this error since it has not been active for a while. Do please reopen if you find any other problems. Thanks!

axelBarroso avatar Sep 18 '24 08:09 axelBarroso