MS-SVConv pycuda._driver.MemoryError: cuMemAlloc failed: out of memory

System: Ubuntu 18.04
PyTorch 1.9.0 + CUDA 11.1, A100 with 40GB memory.
Hydra 1.0.5

Hello, I run the command and got the output as follows:

command:

poetry run python train.py task=registration models=registration/ms_svconv_base model_name=MS_SVCONV_B2cm_X2_3head data=registration/fragment3dmatch training=sparse_fragment_reg tracker_options.make_submission=True training.epochs=200 eval_frequency=10

output:

Error executing job with overrides: ['task=registration', 'models=registration/ms_svconv_base', 'model_
name=MS_SVCONV_B2cm_X2_3head', 'data=registration/fragment3dmatch_sparse', 'training=sparse_fragment_re
g', 'tracker_options.make_submission=True', 'training.epochs=200', 'eval_frequency=10']
Traceback (most recent call last):
  File "/LOCAL2/ramdrop/github/point_registration/torch-points3d/train.py", line 13, in main
    trainer = Trainer(cfg)
  File "/LOCAL2/ramdrop/github/point_registration/torch-points3d/torch_points3d/trainer.py", line 49, i
n __init__
    self._initialize_trainer()
  File "/LOCAL2/ramdrop/github/point_registration/torch-points3d/torch_points3d/trainer.py", line 96, i
n _initialize_trainer
    self._dataset: BaseDataset = instantiate_dataset(self._cfg.data)
  File "/LOCAL2/ramdrop/github/point_registration/torch-points3d/torch_points3d/datasets/dataset_factor
y.py", line 46, in instantiate_dataset
    dataset = dataset_cls(dataset_config)
  File "/LOCAL2/ramdrop/github/point_registration/torch-points3d/torch_points3d/datasets/registration/g
eneral3dmatch.py", line 355, in __init__
    self.train_dataset = Fragment3DMatch(
  File "/LOCAL2/ramdrop/github/point_registration/torch-points3d/torch_points3d/datasets/registration/general3dmatch.py", line 260, in __init__
    Base3DMatch.__init__(
  File "/LOCAL2/ramdrop/github/point_registration/torch-points3d/torch_points3d/datasets/registration/base3dmatch.py", line 122, in __init__
    super(Base3DMatch, self).__init__(root,
  File "/LOCAL2/ramdrop/apps/poetry/cache/virtualenvs/torch-points3d-s_H0q_C5-py3.9/lib/python3.9/site-packages/torch_geometric/data/dataset.py", line 87, in __init__
    self._process()
  File "/LOCAL2/ramdrop/apps/poetry/cache/virtualenvs/torch-points3d-s_H0q_C5-py3.9/lib/python3.9/site-packages/torch_geometric/data/dataset.py", line 170, in _process
    self.process()
  File "/LOCAL2/ramdrop/github/point_registration/torch-points3d/torch_points3d/datasets/registration/general3dmatch.py", line 300, in process
    super().process()
  File "/LOCAL2/ramdrop/github/point_registration/torch-points3d/torch_points3d/datasets/registration/base3dmatch.py", line 329, in process
    self._create_fragment(self.mode)
  File "/LOCAL2/ramdrop/github/point_registration/torch-points3d/torch_points3d/datasets/registration/base3dmatch.py", line 202, in _create_fragment
    rgbd2fragment_fine(list_path_frames,
  File "/LOCAL2/ramdrop/github/point_registration/torch-points3d/torch_points3d/datasets/registration/utils.py", line 271, in rgbd2fragment_fine
    tsdf_vol = fusion.TSDFVolume(vol_bnds, voxel_size=voxel_size)
  File "/LOCAL2/ramdrop/github/point_registration/torch-points3d/torch_points3d/datasets/registration/fusion.py", line 61, in __init__
    self._weight_vol_gpu = cuda.mem_alloc(self._weight_vol_cpu.nbytes)
pycuda._driver.MemoryError: cuMemAlloc failed: out of memory

Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.

I checked the GPU allocated memory recorded by wandb (I tried two different versions of pycuda and both cases resulted in the same error shown above): gpu

Is it normal that the GPU allocated memory keeps increasing during the data preprocessing? I thought A100 with 40GB memory is sufficient for this job. If it isn't, do you know the minimum memory requirement for preprocessing 3DMatch dataset?

Apr 26 '22 16:04 ramdrop

It is weird because, every experiments were run on an 2080 Ti or a 1080 Ti. You can find the training set I generated here : https://cloud.mines-paristech.fr/index.php/s/mXN2RuebKjVMhLz

Apr 28 '22 10:04 humanpose1

Thanks for your generated dataset. I have not managed to solve this issue but I found a workaround: split the raw directory list and run it multiple times to preprocess all splits.

Apr 28 '22 20:04 ramdrop

Sorry to bother you again, but I found my training results extremely weird: almost zero feature_matching_recall on both val and test dataset after 50 epochs. I suspect this could result from data preprocessing. So other than the training set you provided, would you mind sharing with me your full 3DMatch dataset as follows? file

Apr 29 '22 09:04 ramdrop

MS-SVConv MS-SVConv copied to clipboard

pycuda._driver.MemoryError: cuMemAlloc failed: out of memory

MS-SVConv
MS-SVConv copied to clipboard