MS-SVConv
MS-SVConv copied to clipboard
pycuda._driver.MemoryError: cuMemAlloc failed: out of memory
- System: Ubuntu 18.04
- PyTorch 1.9.0 + CUDA 11.1, A100 with 40GB memory.
- Hydra 1.0.5
Hello, I run the command and got the output as follows:
command:
poetry run python train.py task=registration models=registration/ms_svconv_base model_name=MS_SVCONV_B2cm_X2_3head data=registration/fragment3dmatch training=sparse_fragment_reg tracker_options.make_submission=True training.epochs=200 eval_frequency=10
output:
Error executing job with overrides: ['task=registration', 'models=registration/ms_svconv_base', 'model_
name=MS_SVCONV_B2cm_X2_3head', 'data=registration/fragment3dmatch_sparse', 'training=sparse_fragment_re
g', 'tracker_options.make_submission=True', 'training.epochs=200', 'eval_frequency=10']
Traceback (most recent call last):
File "/LOCAL2/ramdrop/github/point_registration/torch-points3d/train.py", line 13, in main
trainer = Trainer(cfg)
File "/LOCAL2/ramdrop/github/point_registration/torch-points3d/torch_points3d/trainer.py", line 49, i
n __init__
self._initialize_trainer()
File "/LOCAL2/ramdrop/github/point_registration/torch-points3d/torch_points3d/trainer.py", line 96, i
n _initialize_trainer
self._dataset: BaseDataset = instantiate_dataset(self._cfg.data)
File "/LOCAL2/ramdrop/github/point_registration/torch-points3d/torch_points3d/datasets/dataset_factor
y.py", line 46, in instantiate_dataset
dataset = dataset_cls(dataset_config)
File "/LOCAL2/ramdrop/github/point_registration/torch-points3d/torch_points3d/datasets/registration/g
eneral3dmatch.py", line 355, in __init__
self.train_dataset = Fragment3DMatch(
File "/LOCAL2/ramdrop/github/point_registration/torch-points3d/torch_points3d/datasets/registration/general3dmatch.py", line 260, in __init__
Base3DMatch.__init__(
File "/LOCAL2/ramdrop/github/point_registration/torch-points3d/torch_points3d/datasets/registration/base3dmatch.py", line 122, in __init__
super(Base3DMatch, self).__init__(root,
File "/LOCAL2/ramdrop/apps/poetry/cache/virtualenvs/torch-points3d-s_H0q_C5-py3.9/lib/python3.9/site-packages/torch_geometric/data/dataset.py", line 87, in __init__
self._process()
File "/LOCAL2/ramdrop/apps/poetry/cache/virtualenvs/torch-points3d-s_H0q_C5-py3.9/lib/python3.9/site-packages/torch_geometric/data/dataset.py", line 170, in _process
self.process()
File "/LOCAL2/ramdrop/github/point_registration/torch-points3d/torch_points3d/datasets/registration/general3dmatch.py", line 300, in process
super().process()
File "/LOCAL2/ramdrop/github/point_registration/torch-points3d/torch_points3d/datasets/registration/base3dmatch.py", line 329, in process
self._create_fragment(self.mode)
File "/LOCAL2/ramdrop/github/point_registration/torch-points3d/torch_points3d/datasets/registration/base3dmatch.py", line 202, in _create_fragment
rgbd2fragment_fine(list_path_frames,
File "/LOCAL2/ramdrop/github/point_registration/torch-points3d/torch_points3d/datasets/registration/utils.py", line 271, in rgbd2fragment_fine
tsdf_vol = fusion.TSDFVolume(vol_bnds, voxel_size=voxel_size)
File "/LOCAL2/ramdrop/github/point_registration/torch-points3d/torch_points3d/datasets/registration/fusion.py", line 61, in __init__
self._weight_vol_gpu = cuda.mem_alloc(self._weight_vol_cpu.nbytes)
pycuda._driver.MemoryError: cuMemAlloc failed: out of memory
Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.
I checked the GPU allocated memory recorded by wandb (I tried two different versions of pycuda and both cases resulted in the same error shown above):
Is it normal that the GPU allocated memory keeps increasing during the data preprocessing? I thought A100 with 40GB memory is sufficient for this job. If it isn't, do you know the minimum memory requirement for preprocessing 3DMatch dataset?
It is weird because, every experiments were run on an 2080 Ti or a 1080 Ti. You can find the training set I generated here : https://cloud.mines-paristech.fr/index.php/s/mXN2RuebKjVMhLz
Thanks for your generated dataset. I have not managed to solve this issue but I found a workaround: split the raw directory list and run it multiple times to preprocess all splits.
Sorry to bother you again, but I found my training results extremely weird: almost zero feature_matching_recall on both val and test dataset after 50 epochs. I suspect this could result from data preprocessing. So other than the training set you provided, would you mind sharing with me your full 3DMatch dataset as follows?