nerfstudio icon indicating copy to clipboard operation
nerfstudio copied to clipboard

test_tcnn_instant_ngp_field() Segmentation Fault

Open tancik opened this issue 1 year ago • 11 comments

test_tcnn_instant_ngp_field() has a segfault. This test is only run locally since the github actions don't have tcnn. Interestingly the segfault does not occur if you run the test using the debugger.

tancik avatar Jul 30 '22 22:07 tancik

I think this may have emerged with the recent instant_ngp implementation updates. Maybe @liruilong940607 has some ideas?

tancik avatar Aug 02 '22 00:08 tancik

Somehow I can't reproduce the segfault on my local, with the latest master branch, by running:

either

 pytest tests/fields/test_fields.py

or

python tests/fields/test_fields.py

I even looped the test for 1000 times and still can't see errors.

Can you run the NGP training successfully?

liruilong940607 avatar Aug 02 '22 00:08 liruilong940607

Also this test is quite weird to fail because it only test the field. It doesn't call any CUDA functions used by instant_ngp.

liruilong940607 avatar Aug 02 '22 00:08 liruilong940607

Would you able to locate which commit might cause this? i.e. By reverting to which commit the test can pass.

liruilong940607 avatar Aug 02 '22 00:08 liruilong940607

NGP training works for me. Oddly it only fails for me when I don't run with the debugger. I think Ethan is also has seen this segfault. I'll try to figure out which pr introduced this.

tancik avatar Aug 02 '22 02:08 tancik

Hmmm, even more mysterious. Doesn't seem tied to a commit, I tried going back pretty far, and this only became an issue in the last week or so. Maybe some package version? @ethanweber are you also still having this issue?

tancik avatar Aug 02 '22 04:08 tancik

The fact that VScode debugger doesn't fail this test makes me also feel there might be a difference between your local environment v.s. VScode setting

liruilong940607 avatar Aug 02 '22 07:08 liruilong940607

But you can run NGP training without issue .. hmmm no idea what is going on

liruilong940607 avatar Aug 02 '22 07:08 liruilong940607

I do still have the issue but training is fine. Pretty much the same experience as Matt.

ethanweber avatar Aug 02 '22 10:08 ethanweber

I've been having similar issues-- I ran NGP with faulthandler enabled, and this is what I got.

[2022-08-14 15:42:17,618][root][INFO] - Continuing without viewer.
[2022-08-14 15:42:17,874][root][INFO] - No eval dataset specified so using train dataset for eval.
Fatal Python error: Segmentation fault

Thread 0x00007f6b05bc3700 (most recent call first):
  File "/data/akristoffersen/anaconda/envs/nerfactory/lib/python3.8/threading.py", line 306 in wait
  File "/data/akristoffersen/anaconda/envs/nerfactory/lib/python3.8/queue.py", line 179 in get
  File "/data/akristoffersen/anaconda/envs/nerfactory/lib/python3.8/site-packages/tensorboard/summary/writer/event_file_writer.py", line 227 in run
  File "/data/akristoffersen/anaconda/envs/nerfactory/lib/python3.8/threading.py", line 932 in _bootstrap_inner
  File "/data/akristoffersen/anaconda/envs/nerfactory/lib/python3.8/threading.py", line 890 in _bootstrap

Current thread 0x00007f6b658880c0 (most recent call first):
  File "/home/eecs/akristoffersen/kair/pyrad/nerfactory/fields/density_fields/density_grid.py", line 107 in __init__
  File "/home/eecs/akristoffersen/kair/pyrad/nerfactory/utils/misc.py", line 90 in instantiate_from_dict_config
  File "/home/eecs/akristoffersen/kair/pyrad/nerfactory/models/base.py", line 94 in populate_density_field
  File "/home/eecs/akristoffersen/kair/pyrad/nerfactory/models/base.py", line 74 in __init__
  File "/home/eecs/akristoffersen/kair/pyrad/nerfactory/models/instant_ngp.py", line 50 in __init__
  File "/home/eecs/akristoffersen/kair/pyrad/nerfactory/utils/misc.py", line 90 in instantiate_from_dict_config
  File "/home/eecs/akristoffersen/kair/pyrad/nerfactory/models/base.py", line 232 in setup_model
  File "/home/eecs/akristoffersen/kair/pyrad/nerfactory/utils/profiler.py", line 38 in wrapper
  File "/home/eecs/akristoffersen/kair/pyrad/nerfactory/pipelines/base.py", line 139 in setup_pipeline
  File "/home/eecs/akristoffersen/kair/pyrad/nerfactory/utils/profiler.py", line 38 in wrapper
  File "/home/eecs/akristoffersen/kair/pyrad/nerfactory/engine/trainer.py", line 81 in setup
  File "scripts/run_train.py", line 131 in _train
  File "scripts/run_train.py", line 165 in launch
  File "scripts/run_train.py", line 225 in main
  File "/data/akristoffersen/anaconda/envs/nerfactory/lib/python3.8/site-packages/hydra/core/utils.py", line 186 in run_job
  File "/data/akristoffersen/anaconda/envs/nerfactory/lib/python3.8/site-packages/hydra/_internal/hydra.py", line 119 in run
  File "/data/akristoffersen/anaconda/envs/nerfactory/lib/python3.8/site-packages/hydra/_internal/utils.py", line 453 in <lambda>
  File "/data/akristoffersen/anaconda/envs/nerfactory/lib/python3.8/site-packages/hydra/_internal/utils.py", line 213 in run_and_report
  File "/data/akristoffersen/anaconda/envs/nerfactory/lib/python3.8/site-packages/hydra/_internal/utils.py", line 452 in _run_app
  File "/data/akristoffersen/anaconda/envs/nerfactory/lib/python3.8/site-packages/hydra/_internal/utils.py", line 389 in _run_hydra
  File "/data/akristoffersen/anaconda/envs/nerfactory/lib/python3.8/site-packages/hydra/main.py", line 90 in decorated_main
  File "scripts/run_train.py", line 236 in <module>
/var/lib/slurm-llnl/slurmd/job08862/slurm_script: line 30: 50249 Segmentation fault      python scripts/run_train.py --config-name=graph_instant_ngp.yaml

akristoffersen avatar Aug 14 '22 22:08 akristoffersen

@akristoffersen I think your issue is a different one. It seems to be caused by this line:

https://github.com/plenoptix/nerfactory/blob/387057cd61c9484e1457fccc1e8119b9149cd41b/nerfactory/fields/density_fields/density_grid.py#L106-L107

which requires GPU ("cuda:0") to run it.

Also seems like you are using slurm to run it? I'm not familiar with slurm, but a guess is maybe slurm does some special device management?

liruilong940607 avatar Aug 15 '22 16:08 liruilong940607