SelfReconCode icon indicating copy to clipboard operation
SelfReconCode copied to clipboard

CUDA out of memory

Open cooking43 opened this issue 2 years ago • 8 comments

Hello, I've been training for a while, But an error is reported halfway. Is there any way to solve this problem wiht no changing the graphics card

scene data use female smpl /home/xds/anaconda3/envs/SelfRecon/lib/python3.8/site-packages/torch/functional.py:445: UserWarning: torch.meshgrid: in an upcoming release, it will be required to pass the indexing argument. (Triggered internally at /opt/conda/conda-bld/pytorch_1640811806235/work/aten/src/ATen/native/TensorShape.cpp:2157.) return _VF.meshgrid(tensors, **kwargs) # type: ignore[attr-defined] camera ang threshold is 0.010285 box: [-0.7080196142196655, -1.2795634269714355, -0.3215314447879791] [0.7120546102523804, 0.7051210403442383, 0.3668109178543091] /home/xds/project/SelfReconCode/MCAcc/seg3d_lossless.py:246: UserWarning: floordiv is deprecated, and its behavior will change in a future version of pytorch. It currently rounds toward 0 (like the 'trunc' function NOT 'floor'). This results in incorrect rounding for negative values. To keep the current behavior, use torch.div(a, b, rounding_mode='trunc'), or for actual floor division, use torch.div(a, b, rounding_mode='floor'). stride = (self.resolutions[-1] - 1) // (resolution - 1) /home/xds/project/SelfReconCode/MCAcc/seg3d_lossless.py:261: UserWarning: floordiv is deprecated, and its behavior will change in a future version of pytorch. It currently rounds toward 0 (like the 'trunc' function NOT 'floor'). This results in incorrect rounding for negative values. To keep the current behavior, use torch.div(a, b, rounding_mode='trunc'), or for actual floor division, use torch.div(a, b, rounding_mode='floor'). coords_accum = coords // stride /home/xds/project/SelfReconCode/MCAcc/seg3d_lossless.py:341: UserWarning: floordiv is deprecated, and its behavior will change in a future version of pytorch. It currently rounds toward 0 (like the 'trunc' function NOT 'floor'). This results in incorrect rounding for negative values. To keep the current behavior, use torch.div(a, b, rounding_mode='trunc'), or for actual floor division, use torch.div(a, b, rounding_mode='floor'). voxels = coords // stride /home/xds/project/SelfReconCode/MCAcc/seg3d_lossless.py:381: UserWarning: floordiv is deprecated, and its behavior will change in a future version of pytorch. It currently rounds toward 0 (like the 'trunc' function NOT 'floor'). This results in incorrect rounding for negative values. To keep the current behavior, use torch.div(a, b, rounding_mode='trunc'), or for actual floor division, use torch.div(a, b, rounding_mode='floor'). point_coords = coords // stride /home/xds/project/SelfReconCode/MCAcc/seg3d_lossless.py:417: UserWarning: floordiv is deprecated, and its behavior will change in a future version of pytorch. It currently rounds toward 0 (like the 'trunc' function NOT 'floor'). This results in incorrect rounding for negative values. To keep the current behavior, use torch.div(a, b, rounding_mode='trunc'), or for actual floor division, use torch.div(a, b, rounding_mode='floor'). voxels = coords // stride Traceback (most recent call last): File "train.py", line 167, in loss=optNet(outs,sample_pix_num,ratio,frame_ids,debug_root)
File "/home/xds/anaconda3/envs/SelfRecon/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl return forward_call(*input, **kwargs) File "/home/xds/project/SelfReconCode/model/network.py", line 502, in forward total_loss=self.computeTmpPcLoss(defMeshes,[d_cond,[poses,trans]],masks,mgtMs,ratio) File "/home/xds/project/SelfReconCode/model/network.py", line 687, in computeTmpPcLoss loss.backward() File "/home/xds/anaconda3/envs/SelfRecon/lib/python3.8/site-packages/torch/_tensor.py", line 307, in backward torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs) File "/home/xds/anaconda3/envs/SelfRecon/lib/python3.8/site-packages/torch/autograd/init.py", line 154, in backward Variable._execution_engine.run_backward( File "/home/xds/anaconda3/envs/SelfRecon/lib/python3.8/site-packages/torch/autograd/function.py", line 199, in apply return user_fn(self, *args) File "/home/xds/anaconda3/envs/SelfRecon/lib/python3.8/site-packages/pytorch3d-0.4.0-py3.8-linux-x86_64.egg/pytorch3d/renderer/compositing.py", line 56, in backward grad_features, grad_alphas = _C.accum_alphacomposite_backward( RuntimeError: CUDA out of memory. Tried to allocate 668.00 MiB (GPU 0; 10.76 GiB total capacity; 8.00 GiB already allocated; 443.38 MiB free; 8.18 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

cooking43 avatar Jul 18 '22 11:07 cooking43

The default config requires some memories and a GTX 3090 is recommended. You can change the marching cube resolutions to reduce memory, but the related optimization parameters are also needed to readjust. This is a little tedious.

jby1993 avatar Jul 18 '22 13:07 jby1993

thank you! I will try to adjust the parameters, hoping to succeed

cooking43 avatar Jul 18 '22 14:07 cooking43

Do you know how much memory you need

cooking43 avatar Jul 18 '22 14:07 cooking43

almost 24 Gb

jby1993 avatar Jul 18 '22 17:07 jby1993

I'm using GeForce RTX 3070 Laptop GPU, and got the same error as below. I edited config.conf a bit; reducing "sample_pix_num", "num_workers", "batch_size", but all in fail. Which parameters should I edit to avoid CUDA out of memory error?

error message

$ CUDA_VISIBLE_DEVICES=0 python train.py --gpu-ids 0 --conf config.conf --data $ROOT/female-3-casual --save-folder result scene data use female smpl /home/mas/anaconda3/envs/SelfRecon/lib/python3.8/site-packages/torch/functional.py:445: UserWarning: torch.meshgrid: in an upcoming release, it will be required to pass the indexing argument. (Triggered internally at /opt/conda/conda-bld/pytorch_1640811806235/work/aten/src/ATen/native/TensorShape.cpp:2157.) return _VF.meshgrid(tensors, **kwargs) # type: ignore[attr-defined] Traceback (most recent call last): File "train.py", line 98, in optNet,sdf_initialized=getOptNet(dataset,batch_size,bmins,bmaxs,resolutions['coarse'],device,config,use_initial_sdf) File "/home/mas/proj/study/computer_vision/SelfReconCode/model/network.py", line 850, in getOptNet skinner,tmpBodyVs,tmpBodyFs=initialLBSkinner(dataset.gender,dataset.shape.to(device),initPose,(128+1, 224+1, 64+1),bmins,bmaxs) File "/home/mas/proj/study/computer_vision/SelfReconCode/model/Deformer.py", line 294, in initialLBSkinner ws=compute_lbswField(bmins,bmaxs,resolution,verts.view(6890,3),smpl.weight.view(6890,24),align_corners=False,mean_neighbor=30,smooth_times=30) File "/home/mas/proj/study/computer_vision/SelfReconCode/model/Deformer.py", line 269, in compute_lbswField dists,indices=(tmp[:,None,:]-smpl_verts[None,:,:]).norm(dim=-1).topk(mean_neighbor,dim=-1,largest=False) File "/home/mas/anaconda3/envs/SelfRecon/lib/python3.8/site-packages/torch/_tensor.py", line 442, in norm return torch.norm(self, p, dim, keepdim, dtype=dtype) File "/home/mas/anaconda3/envs/SelfRecon/lib/python3.8/site-packages/torch/functional.py", line 1442, in norm return _VF.frobenius_norm(input, _dim, keepdim=keepdim) RuntimeError: CUDA out of memory. Tried to allocate 1.29 GiB (GPU 0; 7.80 GiB total capacity; 5.22 GiB already allocated; 724.12 MiB free; 5.23 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

MasahiroOgawa avatar Oct 11 '22 00:10 MasahiroOgawa

zhihu import os os.environ["PYTORCH_CUDA_ALLOC_CONF"] = "max_split_size_mb:128"

SMY19999 avatar Dec 30 '22 12:12 SMY19999

Thank you, will try.

MasahiroOgawa avatar Dec 30 '22 23:12 MasahiroOgawa

I put

os.environ["PYTORCH_CUDA_ALLOC_CONF"] = "max_split_size_mb:128"

at line 27 in train.py, and run

CUDA_VISIBLE_DEVICES=0 python train.py --gpu-ids 0 --conf config.conf --data $ROOT/female-3-casual --save-folder result

But it failed with "Segmentation fault (core dumped)" ...

MasahiroOgawa avatar Dec 31 '22 00:12 MasahiroOgawa