Pointnet_Pointnet2_pytorch icon indicating copy to clipboard operation
Pointnet_Pointnet2_pytorch copied to clipboard

CUDA out of memory while doing part segmentation training

Open blueeaglex opened this issue 1 year ago • 2 comments

I tried to run the command below: python train_partseg.py --model pointnet2_part_seg_msg --normal --log_dir pointnet2_part_seg_msg but I got cuda error. Here's the detail:

PARAMETER ... Namespace(model='pointnet2_part_seg_msg', batch_size=16, epoch=251, learning_rate=0.001, gpu='0', optimizer='Adam', log_dir='pointnet2_part_seg_msg', decay_rate=0.0001, npoint=2048, normal=True, step_size=20, lr_decay=0.5) The number of training data is: 13998 The number of test data is: 2874 Use pretrain model Epoch 1 (106/251): Learning rate:0.000031 BN momentum updated to: 0.010000 0%| | 0/874 [00:01<?, ?it/s] Traceback (most recent call last): File "/home/lidar/aloglab/Pointnet_Pointnet2_pytorch/train_partseg.py", line 305, in <module> main(args) File "/home/lidar/aloglab/Pointnet_Pointnet2_pytorch/train_partseg.py", line 193, in main seg_pred, trans_feat = classifier(points, to_categorical(label, num_classes)) File "/home/lidar/anaconda3/envs/PointPillars/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl return forward_call(*input, **kwargs) File "/home/lidar/aloglab/Pointnet_Pointnet2_pytorch/models/pointnet2_part_seg_msg.py", line 36, in forward l2_xyz, l2_points = self.sa2(l1_xyz, l1_points) File "/home/lidar/anaconda3/envs/PointPillars/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl return forward_call(*input, **kwargs) File "/home/lidar/aloglab/Pointnet_Pointnet2_pytorch/models/pointnet2_utils.py", line 248, in forward grouped_points = torch.cat([grouped_points, grouped_xyz], dim=-1) torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 324.00 MiB (GPU 0; 5.80 GiB total capacity; 4.62 GiB already allocated; 137.06 MiB free; 5.47 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

I didn't find the option named 'max_split_size_md' in the code. Any ideas will be much appreciate.

blueeaglex avatar Jan 11 '23 09:01 blueeaglex

Well. After I add the option '--batch_size=10', it ran normally. Problem solved. But I suggest that the max batch size should be computed by the program.

blueeaglex avatar Jan 12 '23 01:01 blueeaglex

井。添加选项“--batch_size=10”后,它正常运行。问题解决了。但我建议最大批大小应由程序计算。

你好我也是做零件分割的 可以加个联系方式吗 微信18018594107

wangzhen5201314 avatar Feb 15 '23 07:02 wangzhen5201314