nerfstudio
nerfstudio copied to clipboard
test_tcnn_instant_ngp_field() Segmentation Fault
test_tcnn_instant_ngp_field()
has a segfault. This test is only run locally since the github actions don't have tcnn. Interestingly the segfault does not occur if you run the test using the debugger.
I think this may have emerged with the recent instant_ngp
implementation updates. Maybe @liruilong940607 has some ideas?
Somehow I can't reproduce the segfault on my local, with the latest master branch, by running:
either
pytest tests/fields/test_fields.py
or
python tests/fields/test_fields.py
I even looped the test for 1000
times and still can't see errors.
Can you run the NGP training successfully?
Also this test is quite weird to fail because it only test the field
. It doesn't call any CUDA functions used by instant_ngp
.
Would you able to locate which commit might cause this? i.e. By reverting to which commit the test can pass.
NGP training works for me. Oddly it only fails for me when I don't run with the debugger. I think Ethan is also has seen this segfault. I'll try to figure out which pr introduced this.
Hmmm, even more mysterious. Doesn't seem tied to a commit, I tried going back pretty far, and this only became an issue in the last week or so. Maybe some package version? @ethanweber are you also still having this issue?
The fact that VScode debugger doesn't fail this test makes me also feel there might be a difference between your local environment v.s. VScode setting
But you can run NGP training without issue .. hmmm no idea what is going on
I do still have the issue but training is fine. Pretty much the same experience as Matt.
I've been having similar issues-- I ran NGP with faulthandler
enabled, and this is what I got.
[2022-08-14 15:42:17,618][root][INFO] - Continuing without viewer.
[2022-08-14 15:42:17,874][root][INFO] - No eval dataset specified so using train dataset for eval.
Fatal Python error: Segmentation fault
Thread 0x00007f6b05bc3700 (most recent call first):
File "/data/akristoffersen/anaconda/envs/nerfactory/lib/python3.8/threading.py", line 306 in wait
File "/data/akristoffersen/anaconda/envs/nerfactory/lib/python3.8/queue.py", line 179 in get
File "/data/akristoffersen/anaconda/envs/nerfactory/lib/python3.8/site-packages/tensorboard/summary/writer/event_file_writer.py", line 227 in run
File "/data/akristoffersen/anaconda/envs/nerfactory/lib/python3.8/threading.py", line 932 in _bootstrap_inner
File "/data/akristoffersen/anaconda/envs/nerfactory/lib/python3.8/threading.py", line 890 in _bootstrap
Current thread 0x00007f6b658880c0 (most recent call first):
File "/home/eecs/akristoffersen/kair/pyrad/nerfactory/fields/density_fields/density_grid.py", line 107 in __init__
File "/home/eecs/akristoffersen/kair/pyrad/nerfactory/utils/misc.py", line 90 in instantiate_from_dict_config
File "/home/eecs/akristoffersen/kair/pyrad/nerfactory/models/base.py", line 94 in populate_density_field
File "/home/eecs/akristoffersen/kair/pyrad/nerfactory/models/base.py", line 74 in __init__
File "/home/eecs/akristoffersen/kair/pyrad/nerfactory/models/instant_ngp.py", line 50 in __init__
File "/home/eecs/akristoffersen/kair/pyrad/nerfactory/utils/misc.py", line 90 in instantiate_from_dict_config
File "/home/eecs/akristoffersen/kair/pyrad/nerfactory/models/base.py", line 232 in setup_model
File "/home/eecs/akristoffersen/kair/pyrad/nerfactory/utils/profiler.py", line 38 in wrapper
File "/home/eecs/akristoffersen/kair/pyrad/nerfactory/pipelines/base.py", line 139 in setup_pipeline
File "/home/eecs/akristoffersen/kair/pyrad/nerfactory/utils/profiler.py", line 38 in wrapper
File "/home/eecs/akristoffersen/kair/pyrad/nerfactory/engine/trainer.py", line 81 in setup
File "scripts/run_train.py", line 131 in _train
File "scripts/run_train.py", line 165 in launch
File "scripts/run_train.py", line 225 in main
File "/data/akristoffersen/anaconda/envs/nerfactory/lib/python3.8/site-packages/hydra/core/utils.py", line 186 in run_job
File "/data/akristoffersen/anaconda/envs/nerfactory/lib/python3.8/site-packages/hydra/_internal/hydra.py", line 119 in run
File "/data/akristoffersen/anaconda/envs/nerfactory/lib/python3.8/site-packages/hydra/_internal/utils.py", line 453 in <lambda>
File "/data/akristoffersen/anaconda/envs/nerfactory/lib/python3.8/site-packages/hydra/_internal/utils.py", line 213 in run_and_report
File "/data/akristoffersen/anaconda/envs/nerfactory/lib/python3.8/site-packages/hydra/_internal/utils.py", line 452 in _run_app
File "/data/akristoffersen/anaconda/envs/nerfactory/lib/python3.8/site-packages/hydra/_internal/utils.py", line 389 in _run_hydra
File "/data/akristoffersen/anaconda/envs/nerfactory/lib/python3.8/site-packages/hydra/main.py", line 90 in decorated_main
File "scripts/run_train.py", line 236 in <module>
/var/lib/slurm-llnl/slurmd/job08862/slurm_script: line 30: 50249 Segmentation fault python scripts/run_train.py --config-name=graph_instant_ngp.yaml
@akristoffersen I think your issue is a different one. It seems to be caused by this line:
https://github.com/plenoptix/nerfactory/blob/387057cd61c9484e1457fccc1e8119b9149cd41b/nerfactory/fields/density_fields/density_grid.py#L106-L107
which requires GPU ("cuda:0") to run it.
Also seems like you are using slurm to run it? I'm not familiar with slurm, but a guess is maybe slurm does some special device management?
No longer an issue.