IRON icon indicating copy to clipboard operation
IRON copied to clipboard

Cuda out of memory

Open Michaelwhite34 opened this issue 1 year ago • 6 comments

train_scene.sh drv/rabbit Hello Wooden Load data: Begin Not using masks image shape, mask shape: torch.Size([324, 768, 1024, 3]) torch.Size([324, 768, 1024, 3]) image pixel range: 0.0 1.0 Load data: End 0%| | 0/100001 [00:00<?, ?it/s] Traceback (most recent call last): File "render_volume.py", line 449, in runner.train() File "render_volume.py", line 127, in train render_out = self.renderer.render( File "/home/michael/iron/models/renderer.py", line 374, in render ret_fine = self.render_core( File "/home/michael/iron/models/renderer.py", line 233, in render_core gradients = sdf_network.gradient(pts) File "/home/michael/iron/models/fields.py", line 110, in gradient gradients = torch.autograd.grad( File "/home/michael/anaconda3/envs/iron/lib/python3.8/site-packages/torch/autograd/init.py", line 275, in grad return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass RuntimeError: CUDA out of memory. Tried to allocate 64.00 MiB (GPU 0; 5.80 GiB total capacity; 4.03 GiB already allocated; 118.56 MiB free; 4.08 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF Wrote config file to ./exp_iron_stage2/drv/rabbit/args.txt render_surface.py:256: DeprecationWarning: Starting with ImageIO v3 the behavior of this function will switch to that of iio.v3.imread. To keep the current behavior (and make this warning dissapear) use import imageio.v2 as imageio or call imageio.v2.imread directly. im = imageio.imread(fpath).astype(np.float32) / 255.0 ic| fill_holes: False handle_edges: True is_training: True args.inv_gamma_gt: False 0%| | 0/50001 [00:00<?, ?it/s]ic| args.out_dir: './exp_iron_stage2/drv/rabbit' global_step: 0 loss.item(): 0.00573146715760231 img_loss.item(): 0.0 img_l2_loss.item(): 0.0 img_ssim_loss.item(): 0.0 eik_loss.item(): 0.00573146715760231 roughrange_loss.item(): 0.0 color_network_dict["point_light_network"].get_light().item(): 5.6220927238464355 1%|▎ | 499/50001 [01:35<3:20:37, 4.11it/s]ic| args.out_dir: './exp_iron_stage2/drv/rabbit' global_step: 500 loss.item(): 0.014144735410809517 img_loss.item(): 0.0 img_l2_loss.item(): 0.0 img_ssim_loss.item(): 0.0 eik_loss.item(): 0.014144735410809517 roughrange_loss.item(): 0.0 color_network_dict["point_light_network"].get_light().item(): 5.224419593811035

Michaelwhite34 avatar Aug 22 '22 06:08 Michaelwhite34

Another out of memory when I stop the process

^Z [1]+ Stopped python render_surface.py --data_dir ./data_flashlight/${SCENE}/train --out_dir ./exp_iron_stage2/${SCENE} --neus_ckpt_fpath ./exp_iron_stage1/${SCENE}/checkpoints/ckpt_100000.pth --num_iters 50001 --gamma_pred ic| args: Namespace(data_dir='./data_flashlight/drv/rabbit/test', eik_weight=0.1, export_all=False, gamma_pred=True, init_light_scale=8.0, inv_gamma_gt=False, is_metal=False, neus_ckpt_fpath='./exp_iron_stage1/drv/rabbit/checkpoints/ckpt_100000.pth', no_edgesample=False, num_iters=50001, out_dir='./exp_iron_stage2/drv/rabbit', patch_size=128, plot_image_name=None, render_all=True, roughrange_weight=0.1, ssim_weight=1.0) Wrote config file to ./exp_iron_stage2/drv/rabbit/args.txt Traceback (most recent call last): File "render_surface.py", line 136, in sdf_network = SDFNetwork( File "/home/michael/anaconda3/envs/iron/lib/python3.8/site-packages/torch/nn/modules/module.py", line 688, in cuda return self._apply(lambda t: t.cuda(device)) File "/home/michael/anaconda3/envs/iron/lib/python3.8/site-packages/torch/nn/modules/module.py", line 578, in _apply module._apply(fn) File "/home/michael/anaconda3/envs/iron/lib/python3.8/site-packages/torch/nn/modules/module.py", line 601, in _apply param_applied = fn(param) File "/home/michael/anaconda3/envs/iron/lib/python3.8/site-packages/torch/nn/modules/module.py", line 688, in return self._apply(lambda t: t.cuda(device)) RuntimeError: CUDA error: out of memory CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

Michaelwhite34 avatar Aug 22 '22 07:08 Michaelwhite34

Note at least 12GB GPU memory is needed for the default settings. You can try decreasing the rendered patch size if you have less memory.

Kai-46 avatar Aug 22 '22 22:08 Kai-46

Note at least 12GB GPU memory is needed for the default settings. You can try decreasing the rendered patch size if you have less memory.

I decreased batch_size and n_samples in womask_Iron and it only gives error in the final export mesh and uv stage. Can you tell me exactly which Parameter and file I should be modifying ?

100%|███████████████████████████████████| 50001/50001 [5:30:39<00:00, 2.52it/s] ic| f"Exporting mesh and materials to: {export_out_dir}": ('Exporting mesh and materials to: ' './exp_iron_stage2/drv/rabbit/mesh_and_materials_50000') ic| 'Exporting mesh and uv...'

face_normals incorrect shape, ignoring! /home/michael/iron/models/export_mesh.py:82: UserWarning: torch.eig is deprecated in favor of torch.linalg.eig and will be removed in a future PyTorch release. torch.linalg.eig returns complex tensors of dtype cfloat or cdouble rather than real tensors mimicking complex tensors. L, _ = torch.eig(A) should be replaced with L_complex = torch.linalg.eigvals(A) and L, V = torch.eig(A, eigenvectors=True) should be replaced with L_complex, V_complex = torch.linalg.eig(A) (Triggered internally at ../aten/src/ATen/native/BatchLinearAlgebra.cpp:2910.) vecs = torch.eig(s_cov, True)[1].transpose(0, 1) Traceback (most recent call last): File "render_surface.py", line 549, in export_mesh_and_materials(export_out_dir, sdf_network, color_network_dict) File "render_surface.py", line 325, in export_mesh_and_materials export_mesh(sdf_fn, os.path.join(export_out_dir, "mesh.obj")) File "/home/michael/iron/models/export_mesh.py", line 87, in export_mesh grid_aligned = get_grid(helper.cpu(), resolution) File "/home/michael/iron/models/export_mesh.py", line 41, in get_grid grid_points = torch.tensor(np.vstack([xx.ravel(), yy.ravel(), zz.ravel()]).T, dtype=torch.float).cuda() RuntimeError: CUDA out of memory. Tried to allocate 4.52 GiB (GPU 0; 5.80 GiB total capacity; 68.10 MiB already allocated; 4.27 GiB free; 104.00 MiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF ic| args: Namespace(data_dir='./data_flashlight/drv/rabbit/test', eik_weight=0.1, export_all=False, gamma_pred=True, init_light_scale=8.0, inv_gamma_gt=False, is_metal=False, neus_ckpt_fpath='./exp_iron_stage1/drv/rabbit/checkpoints/ckpt_100000.pth', no_edgesample=False, num_iters=50001, out_dir='./exp_iron_stage2/drv/rabbit', patch_size=128, plot_image_name=None, render_all=True, roughrange_weight=0.1, ssim_weight=1.0) Wrote config file to ./exp_iron_stage2/drv/rabbit/args.txt render_surface.py:256: DeprecationWarning: Starting with ImageIO v3 the behavior of this function will switch to that of iio.v3.imread. To keep the current behavior (and make this warning dissapear) use import imageio.v2 as imageio or call imageio.v2.imread directly. im = imageio.imread(fpath).astype(np.float32) / 255.0 ic| len(image_fpaths): 82 gt_images.shape: torch.Size([82, 768, 1024, 3]) Ks.shape: torch.Size([82, 4, 4]) W2Cs.shape: torch.Size([82, 4, 4]) len(cameras): 82 ic| args.neus_ckpt_fpath: './exp_iron_stage1/drv/rabbit/checkpoints/ckpt_100000.pth' ic| f"Loading from neus checkpoint: {args.neus_ckpt_fpath}": ('Loading from neus checkpoint: ' './exp_iron_stage1/drv/rabbit/checkpoints/ckpt_100000.pth') ic| "Reloading from checkpoint: ": 'Reloading from checkpoint: ' ckpt_fpath: './exp_iron_stage2/drv/rabbit/ckpt_50000.pth' ic| dist: 0.8803050220012665 color_network_dict["point_light_network"].light.data: tensor(1.7133, device='cuda:0') ic| start_step: 50000 ic| f"Rendering images to: {render_out_dir}": 'Rendering images to: ./exp_iron_stage2/drv/rabbit/render_test_50000' 2%|█ | 2/82 [00:23<15:21, 11.52s/it] Traceback (most recent call last): File "render_surface.py", line 367, in results = render_camera( File "/home/michael/iron/models/raytracer.py", line 834, in render_camera results = raytrace_camera( File "/home/michael/anaconda3/envs/iron/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context return func(*args, **kwargs) File "/home/michael/iron/models/raytracer.py", line 581, in raytrace_camera results = raytrace_pixels(sdf_network, raytracer, camera.get_uv(), camera, max_num_rays=max_num_rays) File "/home/michael/anaconda3/envs/iron/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context return func(*args, **kwargs) File "/home/michael/iron/models/raytracer.py", line 392, in raytrace_pixels results = raytracer( File "/home/michael/anaconda3/envs/iron/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl return forward_call(*input, **kwargs) File "/home/michael/anaconda3/envs/iron/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context return func(*args, **kwargs) File "/home/michael/iron/models/raytracer.py", line 67, in forward (sampler_convergent_mask, sampler_points, sampler_sdf, sampler_dis,) = self.ray_sampler( File "/home/michael/iron/models/raytracer.py", line 154, in ray_sampler sdf_val.append(sdf(pnts)) File "/home/michael/iron/models/raytracer.py", line 370, in sdf = lambda x: sdf_network(x)[..., 0] File "/home/michael/anaconda3/envs/iron/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl return forward_call(*input, **kwargs) File "/home/michael/iron/models/fields.py", line 92, in forward x = torch.cat([x, inputs], -1) / np.sqrt(2) RuntimeError: CUDA out of memory. Tried to allocate 196.00 MiB (GPU 0; 5.80 GiB total capacity; 3.94 GiB already allocated; 131.62 MiB free; 4.25 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

Michaelwhite34 avatar Aug 23 '22 01:08 Michaelwhite34

And another question, after preparing for my own images, I just need to run colmap_runner to get kai_cameras_normalized.json and rename it to cam_dict_norm.json ?

Michaelwhite34 avatar Aug 23 '22 01:08 Michaelwhite34

@Kai-46 yeah I met same problem,When trainning superman dataset. I found in models/export_mesh.py grid_points = torch.tensor(np.vstack([xx.ravel(), yy.ravel(), zz.ravel()]).T, dtype=torch.float).cuda() xx, yy, zz size are huge
How to change the default settings?

ForrestPi avatar Sep 03 '22 16:09 ForrestPi

You can use lower resolution to voxelize the neural SDF at a potential sacrifice of final mesh accuracy: https://github.com/Kai-46/IRON/blob/8e9a7c172542afd52b8e6ef28bc96ad52b5ffd5a/models/export_mesh.py#L50 .

Kai-46 avatar Oct 07 '22 05:10 Kai-46