permuto_sdf icon indicating copy to clipboard operation
permuto_sdf copied to clipboard

Getting segmentation_fault on training with viewer

Open cduguet opened this issue 1 year ago • 2 comments

Hello, I'm running a remote ec2 instance, with a remote desktop client called Nice DCV (a competitor to VNC for enterprise, free for ec2). 24GB VRAM and 64GB RAM.

I can train without a viewer with no problems. However, when I try to run it with a viewer, I get segmentation_fault. The app window opens and nothing gets to load before it crashes.

I have tried both experimental and normal docker builds (I have only tried docker). I have tried checking out multiple versions of the repo (783c41f and e72ae5b), to see if the problem was recently introduced. Nothing has worked so far. The problem I get looks like this:

/workspace/permuto_sdf$ ./permuto_sdf_py/train_permuto_sdf.py --dataset dtu --scene dtu_scan24 --comp_name comp_3 --exp_info default 
args.with_mask False
args.low_res False
checkpoint_path /workspace/permuto_sdf/checkpoints
with_viewer True
has_apex True
[    D96CB740]DataLoaderDTU.cxx:173      1| loaded nr of scenes 1 for mode train
[    D96CB740]DataLoaderDTU.cxx:432      1| reading poses and intrinsics for scene "dtu_scan24"
[    D96CB740]DataLoaderDTU.cxx:173      1| loaded nr of scenes 1 for mode test
[    D96CB740]DataLoaderDTU.cxx:432      1| reading poses and intrinsics for scene "dtu_scan24"
[    D96CB740]    Mesh.cxx:3390     1| read obj with path /workspace/easy_pbr/data/sphere.obj
Segmentation fault (core dumped)

In contrast, when I train without a viewer, it looks like this:

/workspace/permuto_sdf$ ./permuto_sdf_py/train_permuto_sdf.py --dataset dtu --scene dtu_scan24 --comp_name comp_3 --exp_info default --no_viewer
args.with_mask False
args.low_res False
checkpoint_path /workspace/permuto_sdf/checkpoints
with_viewer False
has_apex True
[    2A5FF740]DataLoaderDTU.cxx:173      1| loaded nr of scenes 1 for mode train
[    2A5FF740]DataLoaderDTU.cxx:432      1| reading poses and intrinsics for scene "dtu_scan24"
[    2A5FF740]DataLoaderDTU.cxx:173      1| loaded nr of scenes 1 for mode test
[    2A5FF740]DataLoaderDTU.cxx:432      1| reading poses and intrinsics for scene "dtu_scan24"
phase.iter_nr 1000 loss  1.3530950546264648
phase.iter_nr 2000 loss  0.15609805285930634
phase.iter_nr 3000 loss  0.10311679542064667
...

How should I best troubleshoot this?

cduguet avatar Jul 06 '23 20:07 cduguet