neuralangelo
neuralangelo copied to clipboard
Mesh extraction issue with outdoor scene
Hi After successfully visualize the Lego example with great mesh look, I decided to try outdoor scene (SCENE_TYPE = outdoor) with more images. When running mesh extraction command, I encountered an issue and I'm not sure whether it's a GPU memory problem or not.
Here is the command I use:
torchrun --nproc_per_node=${GPUS} projects/neuralangelo/scripts/extract_mesh.py --config=${CONFIG} --checkpoint=${CHECKPOINT} --output_file=${OUTPUT_MESH} --resolution=${RESOLUTION} --block_res=${BLOCK_RES} --textured --keep_lcc
and here is the error log
(Setting affinity with NVML failed, skipping...)
Running mesh extraction with 1 GPUs.
Setup trainer.
Using random seed 0
/home/ryan_lin/miniconda3/envs/neuralangelo/lib/python3.8/site-packages/tinycudann/modules.py:53: UserWarning: tinycudann was built for lower compute capability (86) than the system's (89). Performance may be suboptimal.
warnings.warn(f"tinycudann was built for lower compute capability ({cc}) than the system's ({system_compute_capability}). Performance may be suboptimal.")
model parameter count: 99,705,900
Initialize model weights using type: none, gain: None
Using random seed 0
Allow TensorFloat32 operations on supported devices
Loading checkpoint (local): logs/MBC_group/MBC50_R1/epoch_00311_iteration_000500000_checkpoint.pt
- Loading the model...
Done with loading the checkpoint.
Extracting surface at resolution 1536 931 1323
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -9) local_rank: 0 (pid: 7223) of binary: /home/ryan_lin/miniconda3/envs/neuralangelo/bin/python
Traceback (most recent call last):
File "/home/ryan_lin/miniconda3/envs/neuralangelo/bin/torchrun", line 10, in <module>
sys.exit(main())
File "/home/ryan_lin/miniconda3/envs/neuralangelo/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
return f(*args, **kwargs)
File "/home/ryan_lin/miniconda3/envs/neuralangelo/lib/python3.8/site-packages/torch/distributed/run.py", line 794, in main
run(args)
File "/home/ryan_lin/miniconda3/envs/neuralangelo/lib/python3.8/site-packages/torch/distributed/run.py", line 785, in run
elastic_launch(
File "/home/ryan_lin/miniconda3/envs/neuralangelo/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/ryan_lin/miniconda3/envs/neuralangelo/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
=====================================================
projects/neuralangelo/scripts/extract_mesh.py FAILED
-----------------------------------------------------
Failures:
<NO_OTHER_FAILURES>
-----------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2023-09-23_10:43:10
host : RyanLegionPro7i.
rank : 0 (local_rank: 0)
exitcode : -9 (pid: 7223)
error_file: <N/A>
traceback : Signal 9 (SIGKILL) received by PID 7223
=====================================================
I tried to adjust some parameters such as RESOLUTION and BLOCK_RES used in the command to see whether it makes any difference, the only successful parameter set is RESOLUTION=512 and BLOCK_RES=32 where the quality is extremely bad (the output PLY file is 90MB while lego example PLY file is 172 MB), is there anyway I could successfully extract the mesh with better quality output?
Hi @Ryan-ZL-Lin, you could set a higher RESOLUTION while keeping the same BLOCK_RES for the GPU memory budget.
Thanks @chenhsuanlin
Is there any recommended range for RESOLUTION? for example any number from 2048 to 8192 as long as it's the multiple of 2?
@chenhsuanlin
I tried out your suggestion to set RESOLUTION=4096 and BLOCK_RES=32 to extract the surface for a 40 secs video.
Initially, the estimated time to complete is around 4 hours (~ 300 iterations per sec), and it ran smoothly. However, after about 1 hour, the progress started to slow down quite a lot, here are the screenshots for your reference.
Issue : Although the surface extraction process didn't stop, the estimated time became 1120 hours.
I checked the GPU and VRAM utilization, and it turned out that they are not utilized properly
the progress became worse, the estimated time to complete changed to 17272 hours...