LiDAR-MOS icon indicating copy to clipboard operation
LiDAR-MOS copied to clipboard

About the issue of multi-GPU training.

Open do1nothing opened this issue 6 months ago • 0 comments

My server has four NVIDIA 4090 GPUs. Single-card training doesn't throw any errors, but when the batch size is changed to 2 for single-card training, it throws an error after completing just one epoch. No other parameters have been changed. I wanted to try multi-GPU training, but it keeps throwing errors. I searched online for solutions, but none of them seem to resolve the issue. The error message is as follows: Traceback (most recent call last): File "./train.py", line 186, in trainer.train() File "../../tasks/semantic/modules/trainer.py", line 280, in train show_scans=self.ARCH["train"]["show_scans"]) File "../../tasks/semantic/modules/trainer.py", line 391, in train_epoch output = model(in_vol) File "/root/anaconda3/envs/salsanext/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl return forward_call(*input, **kwargs) File "/root/anaconda3/envs/salsanext/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 156, in forward "them on device: {}".format(self.src_device_obj, t.device)) RuntimeError: module must have its parameters and buffers on device cuda:0 (device_ids[0]) but found one of them on device: cuda:1

do1nothing avatar Dec 31 '23 17:12 do1nothing