SA-ConvONet
SA-ConvONet copied to clipboard
GPU memory issue when running the demo
Hi,
I am facing a memory issue where CUDA complains about insufficient VRAM. I noticed this issue stems from line 245
of the file src/conv_onet/training.py
where it tries to extract 3D features from the input point cloud.
My GPU has 8GiB VRAM but it seems the demo needs more than that. Can you tell me what's the minimum required VRAM size and perhaps there is a way to reduce the VRAM requirement? Thanks
Regards
Hi,
The GPU RAM when I run these experiments was at least 11 GB.
If you are using 8GB GPUs, you can reduce the batchsize from 16 to 12 or 8 to solve the memory issue.
You can also choose to reduce the number of input pointcloud for point feature learning.
Hope it can help you! Please let me know if you have addressed this problem!
Best, Jiapeng
Thanks @tangjiapeng,
I can see the batch size in generate_optim_largescene.py
is set to 1 so I don't think it's the issue with large batch size. I did try to downsample the point cloud as you suggested, I tried to set both pointcloud_n
and pointcloud_subsample
in demo_matterport.yaml
to 4096, but I am still getting OOM errors. I am wondering if this is the correct way of downsampling input points? To help you diagnose this issue I have included the error message below:
Warning: generator does not support pointcloud generation.
0%| | 0/2 [00:00<?, ?it/s]Process scenes in a sliding-window manner
ft only encoder True
only optimize encoder████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 693/693 [01:44<00:00, 6.02it/s]
Traceback (most recent call last):
File "generate_optim_largescene.py", line 235, in <module>
loss = trainer.sign_agnostic_optim_cropscene_step(crop_data, state_dict)
File "/home/xingrui/Workspace/3dmatch/third_party/SA-ConvONet/src/conv_onet/training.py", line 216, in sign_agnostic_optim_cropscene_step
loss = self.compute_sign_agnostic_cropscene_loss(data)
File "/home/xingrui/Workspace/3dmatch/third_party/SA-ConvONet/src/conv_onet/training.py", line 244, in compute_sign_agnostic_cropscene_loss
c = self.model.encode_inputs(inputs)
File "/home/xingrui/Workspace/3dmatch/third_party/SA-ConvONet/src/conv_onet/models/__init__.py", line 60, in encode_inputs
c = self.encoder(inputs)
File "/home/xingrui/miniconda3/envs/sa_conet/lib/python3.6/site-packages/torch/nn/modules/module.py", line 532, in __call__
result = self.forward(*input, **kwargs)
File "/home/xingrui/Workspace/3dmatch/third_party/SA-ConvONet/src/encoder/pointnet.py", line 307, in forward
fea['grid'] = self.generate_grid_features(index['grid'], c)
File "/home/xingrui/Workspace/3dmatch/third_party/SA-ConvONet/src/encoder/pointnet.py", line 262, in generate_grid_features
fea_grid = self.unet3d(fea_grid)
File "/home/xingrui/miniconda3/envs/sa_conet/lib/python3.6/site-packages/torch/nn/modules/module.py", line 532, in __call__
result = self.forward(*input, **kwargs)
File "/home/xingrui/Workspace/3dmatch/third_party/SA-ConvONet/src/encoder/unet3d.py", line 465, in forward
x = decoder(encoder_features, x)
File "/home/xingrui/miniconda3/envs/sa_conet/lib/python3.6/site-packages/torch/nn/modules/module.py", line 532, in __call__
result = self.forward(*input, **kwargs)
File "/home/xingrui/Workspace/3dmatch/third_party/SA-ConvONet/src/encoder/unet3d.py", line 284, in forward
x = self.joining(encoder_features, x)
File "/home/xingrui/Workspace/3dmatch/third_party/SA-ConvONet/src/encoder/unet3d.py", line 291, in _joining
return torch.cat((encoder_features, x), dim=1)
RuntimeError: CUDA out of memory. Tried to allocate 750.00 MiB (GPU 0; 7.79 GiB total capacity; 5.98 GiB already allocated; 446.81 MiB free; 6.03 GiB reserved in total by PyTorch)
Exception ignored in: <bound method tqdm.__del__ of 0%| | 0/2 [01:50<?, ?it/s]>
Traceback (most recent call last):
File "/home/xingrui/miniconda3/envs/sa_conet/lib/python3.6/site-packages/tqdm/_tqdm.py", line 931, in __del__
self.close()
File "/home/xingrui/miniconda3/envs/sa_conet/lib/python3.6/site-packages/tqdm/_tqdm.py", line 1133, in close
self._decr_instances(self)
File "/home/xingrui/miniconda3/envs/sa_conet/lib/python3.6/site-packages/tqdm/_tqdm.py", line 496, in _decr_instances
cls.monitor.exit()
File "/home/xingrui/miniconda3/envs/sa_conet/lib/python3.6/site-packages/tqdm/_monitor.py", line 52, in exit
self.join()
File "/home/xingrui/miniconda3/envs/sa_conet/lib/python3.6/threading.py", line 1053, in join
raise RuntimeError("cannot join current thread")
RuntimeError: cannot join current thread
The batch_size in demo_matterport.yaml was 2, you can set it to 1.
A better choice is to use GPUs with larger RAM.
Hi, xinrui, have you addressed the memory issue?