Gaussian-SLAM
Gaussian-SLAM copied to clipboard
Tracking failure causes NaN
Thanks for the work and the very clean code. I'm running run_slam.py on my own RGBD dataset and while it got off to a good start, at one point tracking seems to have failed because cam_quad_err and cam_trans_err are NaN.
This kills the script because a singular matrix cannot be inverted:
Traceback (most recent call last):
File "/home/abhishek/Code/Gaussian-SLAM/run_slam.py", line 111, in <module>
File "/home/abhishek/Code/Gaussian-SLAM/src/entities/gaussian_slam.py", line 155, in run
opt_dict = self.mapper.map(frame_id, estimate_c2w, gaussian_model, new_submap)
File "/home/abhishek/Code/Gaussian-SLAM/src/entities/mapper.py", line 221, in map
"render_settings": get_render_settings(
File "/home/abhishek/Code/Gaussian-SLAM/src/utils/utils.py", line 93, in get_render_settings
cam_center = torch.inverse(w2c)[:3, 3]
torch._C._LinAlgError: linalg.inv: The diagonal element 2 is zero, the inversion could not be completed because the input matrix is singular.
Traceback (most recent call last):
File "/home/abhishek/Code/Gaussian-SLAM/run_slam.py", line 111, in <module>
File "/home/abhishek/Code/Gaussian-SLAM/src/entities/gaussian_slam.py", line 155, in run
opt_dict = self.mapper.map(frame_id, estimate_c2w, gaussian_model, new_submap)
File "/home/abhishek/Code/Gaussian-SLAM/src/entities/mapper.py", line 221, in map
"render_settings": get_render_settings(
File "/home/abhishek/Code/Gaussian-SLAM/src/utils/utils.py", line 93, in get_render_settings
cam_center = torch.inverse(w2c)[:3, 3]
torch._C._LinAlgError: linalg.inv: The diagonal element 2 is zero, the inversion could not be completed because the input matrix is singular.
I also notice color_loss and depth_loss to be 0.00000, not just for the RGBD frame at which NaN occurred but for several frames before, although they didn't have cam_quad_err and cam_trans_err as NaN which I guess allowed gslam.run() to keep going. In the frames leading up to NaN, the tracking errors are really high (for the frame right before it's cam_quad_err: 0.53914, cam_trans_err: 3293494.00000) and looking at the RGBD renders in the folder mapping_vis, the scene has completely fallen apart.
On scrolling back up to the logs where cam_trans_err had saner values like 0.33, the deterioration in tracking seems to have started for a frame where tracking iterations were doubled, presumably to cope with "higher initial loss":
Higher initial loss, increasing num_iters to 400
There on out, cam_trans_err kept growing: 0.58735, 0.98641, 1.39726, 2.21376, 3.07547, 3.95557, 4.83956, ...
Hi @abhishek47kashyap, thanks for your interest in our work!
May I ask what type of scene the data is recorded on, room-scale or larger, and how large are the motions between frames in general? The tracking could fail under large motions. If haven't done, you could try the config we used for scannet++ dataset and see if there's any difference.
Hi, I have come across the same issue when working with the Replica dataset. Apart from the scene office0, I encounter this problem with all other scenes. I have carefully set up my environment according to your environment.yml and ensured that all Git repositories are using the specified branches. It seems like that the problem is not caused by environment, since the code can run. Thank you for providing assistance. Or could you please provide reconstructed mesh files on the Replica dataset?
Tracking frame 178
iter: 0, color_loss: 13612.00684, depth_loss: 819.75806 , cam_quad_err: nan, cam_trans_err: nan
iter: 20, color_loss: 0.00000, depth_loss: 0.00000 , cam_quad_err: nan, cam_trans_err: nan
iter: 40, color_loss: 0.00000, depth_loss: 0.00000 , cam_quad_err: nan, cam_trans_err: nan
frame_id: 178, cam_quad_err: nan, cam_trans_err: nan , cam_quad_err: nan, cam_trans_err: nan
Traceback (most recent call last):
File "/home/t5820/yhy/Gaussian-SLAM/run_slam.py", line 109, in <module>
gslam.run()
File "/home/t5820/yhy/Gaussian-SLAM/src/entities/gaussian_slam.py", line 137, in run
estimated_c2w = self.tracker.track(
File "/home/t5820/yhy/Gaussian-SLAM/src/entities/tracker.py", line 138, in track
render_settings = get_render_settings(
File "/home/t5820/yhy/Gaussian-SLAM/src/utils/utils.py", line 93, in get_render_settings
cam_center = torch.inverse(w2c)[:3, 3]
torch._C._LinAlgError: linalg.inv: The diagonal element 2 is zero, the inversion could not be completed because the input matrix is singular.
Or sometimes shows:
Tracking frame 110
iter: 0, color_loss: 12819.21484, depth_loss: 742.16980 , cam_quad_err: 0.00011, cam_trans_err: 0.00277
iter: 20, color_loss: 11323.96875, depth_loss: 547.08099 , cam_quad_err: 0.00012, cam_trans_err: 0.00110
iter: 40, color_loss: 10657.52344, depth_loss: 585.00720 , cam_quad_err: 0.00014, cam_trans_err: 0.00095
frame_id: 110, cam_quad_err: 0.00013, cam_trans_err: 0.00103 , cam_quad_err: 0.00013, cam_trans_err: 0.00103
Mapping frame 110
Number of added points: 29225
Gaussian model size 829976
Traceback (most recent call last):
File "/home/t5820/yhy/Gaussian-SLAM/run_slam.py", line 109, in <module>
gslam.run()
File "/home/t5820/yhy/Gaussian-SLAM/src/entities/gaussian_slam.py", line 152, in run
opt_dict = self.mapper.map(frame_id, estimate_c2w, gaussian_model, new_submap)
File "/home/t5820/yhy/Gaussian-SLAM/src/entities/mapper.py", line 236, in map
opt_dict = self.optimize_submap([(frame_id, keyframe)] + self.keyframes, gaussian_model, max_iterations)
File "/home/t5820/yhy/Gaussian-SLAM/src/entities/mapper.py", line 142, in optimize_submap
image[:, mask], gt_image[:, mask]) + self.opt.lambda_dssim * (1.0 - ssim(image, gt_image))
RuntimeError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with "TORCH_USE_CUDA_DSA" to enable device-side assertions.
Hey there, thanks for your interest in our work!
We are very confident that with the "normal" setup, our code is working on Replica and other reported datasets - we tested this many times.
By any chance, did you change any of the hyperparameters? Also, what are your open3d and Cuda versions?
I'm not sure we have the meshes somewhere, but we will try to look for them.
Hey there, thanks for your interest in our work!
We are very confident that with the "normal" setup, our code is working on Replica and other reported datasets - we tested this many times.
By any chance, did you change any of the hyperparameters? Also, what are your open3d and Cuda versions?
I'm not sure we have the meshes somewhere, but we will try to look for them.
Thank you for your reply. I have successfully tested the code on the TUM and ScanNet datasets without modifying any of the hyperparameters. However, I encountered the two previously mentioned bugs alternately when running the code on the Replica dataset. Here is the environment:
(Gslam) t5820@t5820:~/yhy/Gaussian-SLAM$ conda list | grep cuda
cuda-cccl 12.1.55 0 nvidia/label/cuda-12.1.0
cuda-command-line-tools 12.1.0 0 nvidia/label/cuda-12.1.0
cuda-compiler 12.1.0 0 nvidia/label/cuda-12.1.0
cuda-cudart 12.1.55 0 nvidia/label/cuda-12.1.0
cuda-cudart-dev 12.1.55 0 nvidia/label/cuda-12.1.0
cuda-cudart-static 12.1.55 0 nvidia/label/cuda-12.1.0
cuda-cuobjdump 12.1.55 0 nvidia/label/cuda-12.1.0
cuda-cupti 12.1.62 0 nvidia/label/cuda-12.1.0
cuda-cupti-static 12.1.62 0 nvidia/label/cuda-12.1.0
cuda-cuxxfilt 12.1.55 0 nvidia/label/cuda-12.1.0
cuda-documentation 12.1.55 0 nvidia/label/cuda-12.1.0
cuda-driver-dev 12.1.55 0 nvidia/label/cuda-12.1.0
cuda-gdb 12.1.55 0 nvidia/label/cuda-12.1.0
cuda-libraries 12.1.0 0 nvidia/label/cuda-12.1.0
cuda-libraries-dev 12.1.0 0 nvidia/label/cuda-12.1.0
cuda-libraries-static 12.1.0 0 nvidia/label/cuda-12.1.0
cuda-nsight 12.1.55 0 nvidia/label/cuda-12.1.0
cuda-nsight-compute 12.1.0 0 nvidia/label/cuda-12.1.0
cuda-nvcc 12.1.66 0 nvidia/label/cuda-12.1.0
cuda-nvdisasm 12.1.55 0 nvidia/label/cuda-12.1.0
cuda-nvml-dev 12.1.55 0 nvidia/label/cuda-12.1.0
cuda-nvprof 12.1.55 0 nvidia/label/cuda-12.1.0
cuda-nvprune 12.1.55 0 nvidia/label/cuda-12.1.0
cuda-nvrtc 12.1.55 0 nvidia/label/cuda-12.1.0
cuda-nvrtc-dev 12.1.55 0 nvidia/label/cuda-12.1.0
cuda-nvrtc-static 12.1.55 0 nvidia/label/cuda-12.1.0
cuda-nvtx 12.1.66 0 nvidia/label/cuda-12.1.0
cuda-nvvp 12.1.55 0 nvidia/label/cuda-12.1.0
cuda-opencl 12.1.56 0 nvidia/label/cuda-12.1.0
cuda-opencl-dev 12.1.56 0 nvidia/label/cuda-12.1.0
cuda-profiler-api 12.1.55 0 nvidia/label/cuda-12.1.0
cuda-runtime 12.1.0 0 nvidia
cuda-sanitizer-api 12.1.55 0 nvidia/label/cuda-12.1.0
cuda-toolkit 12.1.0 0 nvidia/label/cuda-12.1.0
cuda-tools 12.1.0 0 nvidia/label/cuda-12.1.0
cuda-visual-tools 12.1.0 0 nvidia/label/cuda-12.1.0
faiss-gpu 1.8.0 py3.10_h4c7d538_0_cuda12.1.1 pytorch
gds-tools 1.6.0.25 0 nvidia/label/cuda-12.1.0
libcublas 12.1.0.26 0 nvidia/label/cuda-12.1.0
libcublas-dev 12.1.0.26 0 nvidia/label/cuda-12.1.0
libcublas-static 12.1.0.26 0 nvidia/label/cuda-12.1.0
libcufft 11.0.2.4 0 nvidia/label/cuda-12.1.0
libcufft-dev 11.0.2.4 0 nvidia/label/cuda-12.1.0
libcufft-static 11.0.2.4 0 nvidia/label/cuda-12.1.0
libcufile 1.6.0.25 0 nvidia/label/cuda-12.1.0
libcufile-dev 1.6.0.25 0 nvidia/label/cuda-12.1.0
libcufile-static 1.6.0.25 0 nvidia/label/cuda-12.1.0
libcurand 10.3.2.56 0 nvidia/label/cuda-12.1.0
libcurand-dev 10.3.2.56 0 nvidia/label/cuda-12.1.0
libcurand-static 10.3.2.56 0 nvidia/label/cuda-12.1.0
libcusolver 11.4.4.55 0 nvidia/label/cuda-12.1.0
libcusolver-dev 11.4.4.55 0 nvidia/label/cuda-12.1.0
libcusolver-static 11.4.4.55 0 nvidia/label/cuda-12.1.0
libcusparse 12.0.2.55 0 nvidia/label/cuda-12.1.0
libcusparse-dev 12.0.2.55 0 nvidia/label/cuda-12.1.0
libcusparse-static 12.0.2.55 0 nvidia/label/cuda-12.1.0
libfaiss 1.8.0 h046e95b_0_cuda12.1.1 pytorch
libnpp 12.0.2.50 0 nvidia/label/cuda-12.1.0
libnpp-dev 12.0.2.50 0 nvidia/label/cuda-12.1.0
libnpp-static 12.0.2.50 0 nvidia/label/cuda-12.1.0
libnvjitlink-dev 12.1.55 0 nvidia/label/cuda-12.1.0
libnvjpeg 12.1.0.39 0 nvidia/label/cuda-12.1.0
libnvjpeg-dev 12.1.0.39 0 nvidia/label/cuda-12.1.0
libnvjpeg-static 12.1.0.39 0 nvidia/label/cuda-12.1.0
libnvvm-samples 12.1.55 0 nvidia/label/cuda-12.1.0
nsight-compute 2023.1.0.15 0 nvidia/label/cuda-12.1.0
pytorch 2.1.2 py3.10_cuda12.1_cudnn8.9.2_0 pytorch
pytorch-cuda 12.1 ha16c6d3_6 pytorch
pytorch-mutex 1.0 cuda pytorch
(Gslam) t5820@t5820:~/yhy/Gaussian-SLAM$ conda list | grep open3d
open3d 0.18.0 pypi_0 pypi
(Gslam) t5820@t5820:~/yhy/Gaussian-SLAM$ python --version
Python 3.10.16
(Gslam) t5820@t5820:~/yhy/Gaussian-SLAM$ pip list | grep numpy
numpy 1.26.4
Hmm, that's strange. Can you try running with gt_camera: True?
Hmm, that's strange. Can you try running with
gt_camera: True?
The issue still persists. Additionally, I noticed that gt_camera is neither used in the code nor referenced in the update_config_with_args function, even though it is defined in the config and args.
Tracking frame 275
iter: 0, color_loss: 13580.01953, depth_loss: 1055.05933 , cam_quad_err: 0.00010, cam_trans_err: 0.00278
iter: 20, color_loss: 0.00000, depth_loss: 0.00000 , cam_quad_err: nan, cam_trans_err: nan
iter: 40, color_loss: 0.00000, depth_loss: 0.00000 , cam_quad_err: nan, cam_trans_err: nan
frame_id: 275, cam_quad_err: nan, cam_trans_err: nan , cam_quad_err: nan, cam_trans_err: nan
Mapping frame 275
Traceback (most recent call last):
File "/home/t5820/yhy/Gaussian-SLAM/run_slam.py", line 109, in <module>
gslam.run()
File "/home/t5820/yhy/Gaussian-SLAM/src/entities/gaussian_slam.py", line 152, in run
opt_dict = self.mapper.map(frame_id, estimate_c2w, gaussian_model, new_submap)
File "/home/t5820/yhy/Gaussian-SLAM/src/entities/mapper.py", line 234, in map
"render_settings": get_render_settings(
File "/home/t5820/yhy/Gaussian-SLAM/src/utils/utils.py", line 93, in get_render_settings
cam_center = torch.inverse(w2c)[:3, 3]
torch._C._LinAlgError: linalg.inv: The diagonal element 2 is zero, the inversion could not be completed because the input matrix is singular.
Thanks for spotting the gt flag - it is not functional.
We've double-checked and could reproduce the results obtained on Replica without this error.
It is a bit hard to debug it in this way. From the log you attached, I see that colour and depth loss are very big, so go to nan. Moreover, they're valid on the very first iteration. Can you double-check that the input RGBD maps for these frames are not corrupted? Also, can you check the colour and depth renders of the Gaussians during the tracking step for every iteration?
Thanks for spotting the
gtflag - it is not functional.We've double-checked and could reproduce the results obtained on Replica without this error.
It is a bit hard to debug it in this way. From the log you attached, I see that colour and depth loss are very big, so go to nan. Moreover, they're valid on the very first iteration. Can you double-check that the input RGBD maps for these frames are not corrupted? Also, can you check the colour and depth renders of the Gaussians during the tracking step for every iteration?
I guess that I have identified the issue and modified the code to address it. In the tracking function, you optimize the quaternion and then convert it to a rotation matrix using the build_rotation function. However, for some reason (such as numerical stability), the resulting matrix does not satisfy the properties of a rotation matrix. Therefore, I cached opt_cam_rot and opt_cam_trans to allow for rollback and added validation for the generated cur_rel_w2c.
Additionally, there is another issue: since you backward gradient first, the current total_loss corresponds to the w2c from the previous iteration. Consequently, the best_w2c here might be suboptimal.
https://github.com/VladimirYugay/Gaussian-SLAM/blob/eaec10d73ce7511563882b8856896e06d1f804e3/src/entities/tracker.py#L174-L197
@Howie-Ye Hello, I have also encountered the same problem as you. Can you share the code to solve it?I would be very grateful if you could help me
@Howie-Ye Hello, I have also encountered the same problem as you. Can you share the code to solve it?I would be very grateful if you could help me
opt_rot_cache, opt_trans_cache = opt_cam_rot.clone(), opt_cam_trans.clone()
for iter in range(num_iters):
color_loss, depth_loss, _, _, _, = self.compute_losses(
gaussian_model, render_settings, opt_cam_rot, opt_cam_trans, gt_color, gt_depth, depth_mask)
total_loss = (self.w_color_loss * color_loss + (1 - self.w_color_loss) * depth_loss)
total_loss.backward()
gaussian_model.optimizer.step()
gaussian_model.optimizer.zero_grad(set_to_none=True)
with torch.no_grad():
cur_quat, cur_trans = F.normalize(opt_cam_rot[None].clone().detach()), opt_cam_trans.clone().detach()
rot_tmp = build_rotation(cur_quat)[0]
if not self.is_rotation_matrix(rot_tmp):
opt_cam_trans = torch.nn.Parameter(opt_trans_cache)
opt_cam_rot = torch.nn.Parameter(opt_rot_cache)
cur_quat, cur_trans = F.normalize(opt_cam_rot[None].clone().detach()), opt_cam_trans.clone().detach()
rot_tmp = build_rotation(cur_quat)[0]
else:
opt_rot_cache, opt_trans_cache = opt_cam_rot.clone(), opt_cam_trans.clone()
if total_loss.item() < current_min_loss:
current_min_loss = total_loss.item()
best_w2c = torch.eye(4)
best_w2c[:3, :3] = build_rotation(F.normalize(opt_cam_rot[None].clone().detach().cpu()))[0]
best_w2c[:3, 3] = opt_cam_trans.clone().detach().cpu()
cur_rel_w2c = torch.eye(4)
cur_rel_w2c[:3, :3] = rot_tmp
cur_rel_w2c[:3, 3] = cur_trans
def is_rotation_matrix(self, R):
identity = torch.eye(3, device=R.device)
RtR = torch.mm(R.t(), R)
orthogonality_check = torch.allclose(RtR, identity, atol=1e-6)
det_check = torch.isclose(torch.det(R), torch.tensor(1.0), atol=1e-6)
return orthogonality_check and det_check
Try to see if you can solve the problem.
@Howie-Ye I implemented a similar method and found that while it can prevent crashes caused by nan, it can also prevent the optimizer's gradients from propagating and accumulating tracking errors. I tried resetting the optimizer, but it resulted in the same nan as the previous iteration due to the same input. I also tried adding random perturbations in Odometry, and the result was the same.
It is unfortunate you are seeing these issues - we didn't encounter them ourselves. If it is a numerical error, could you try using this library? They have standard functions to convert back-and-forth quaternions, rotations matrices, etc.