Gaussian-SLAM icon indicating copy to clipboard operation
Gaussian-SLAM copied to clipboard

Tracking failure causes NaN

Open abhishek47kashyap opened this issue 1 year ago • 1 comments

Thanks for the work and the very clean code. I'm running run_slam.py on my own RGBD dataset and while it got off to a good start, at one point tracking seems to have failed because cam_quad_err and cam_trans_err are NaN.

This kills the script because a singular matrix cannot be inverted:

Traceback (most recent call last):
  File "/home/abhishek/Code/Gaussian-SLAM/run_slam.py", line 111, in <module>
  File "/home/abhishek/Code/Gaussian-SLAM/src/entities/gaussian_slam.py", line 155, in run
    opt_dict = self.mapper.map(frame_id, estimate_c2w, gaussian_model, new_submap)
  File "/home/abhishek/Code/Gaussian-SLAM/src/entities/mapper.py", line 221, in map
    "render_settings": get_render_settings(
  File "/home/abhishek/Code/Gaussian-SLAM/src/utils/utils.py", line 93, in get_render_settings
    cam_center = torch.inverse(w2c)[:3, 3]
torch._C._LinAlgError: linalg.inv: The diagonal element 2 is zero, the inversion could not be completed because the input matrix is singular.
Traceback (most recent call last):
  File "/home/abhishek/Code/Gaussian-SLAM/run_slam.py", line 111, in <module>
  File "/home/abhishek/Code/Gaussian-SLAM/src/entities/gaussian_slam.py", line 155, in run
    opt_dict = self.mapper.map(frame_id, estimate_c2w, gaussian_model, new_submap)
  File "/home/abhishek/Code/Gaussian-SLAM/src/entities/mapper.py", line 221, in map
    "render_settings": get_render_settings(
  File "/home/abhishek/Code/Gaussian-SLAM/src/utils/utils.py", line 93, in get_render_settings
    cam_center = torch.inverse(w2c)[:3, 3]
torch._C._LinAlgError: linalg.inv: The diagonal element 2 is zero, the inversion could not be completed because the input matrix is singular.

I also notice color_loss and depth_loss to be 0.00000, not just for the RGBD frame at which NaN occurred but for several frames before, although they didn't have cam_quad_err and cam_trans_err as NaN which I guess allowed gslam.run() to keep going. In the frames leading up to NaN, the tracking errors are really high (for the frame right before it's cam_quad_err: 0.53914, cam_trans_err: 3293494.00000) and looking at the RGBD renders in the folder mapping_vis, the scene has completely fallen apart.

On scrolling back up to the logs where cam_trans_err had saner values like 0.33, the deterioration in tracking seems to have started for a frame where tracking iterations were doubled, presumably to cope with "higher initial loss":

Higher initial loss, increasing num_iters to 400

There on out, cam_trans_err kept growing: 0.58735, 0.98641, 1.39726, 2.21376, 3.07547, 3.95557, 4.83956, ...

abhishek47kashyap avatar Mar 27 '24 19:03 abhishek47kashyap

Hi @abhishek47kashyap, thanks for your interest in our work!

May I ask what type of scene the data is recorded on, room-scale or larger, and how large are the motions between frames in general? The tracking could fail under large motions. If haven't done, you could try the config we used for scannet++ dataset and see if there's any difference.

unique1i avatar Mar 28 '24 10:03 unique1i

Hi, I have come across the same issue when working with the Replica dataset. Apart from the scene office0, I encounter this problem with all other scenes. I have carefully set up my environment according to your environment.yml and ensured that all Git repositories are using the specified branches. It seems like that the problem is not caused by environment, since the code can run. Thank you for providing assistance. Or could you please provide reconstructed mesh files on the Replica dataset?

Tracking frame 178
iter: 0, color_loss: 13612.00684, depth_loss: 819.75806 , cam_quad_err: nan, cam_trans_err: nan
iter: 20, color_loss: 0.00000, depth_loss: 0.00000 , cam_quad_err: nan, cam_trans_err: nan
iter: 40, color_loss: 0.00000, depth_loss: 0.00000 , cam_quad_err: nan, cam_trans_err: nan
frame_id: 178, cam_quad_err: nan, cam_trans_err: nan , cam_quad_err: nan, cam_trans_err: nan
Traceback (most recent call last):
  File "/home/t5820/yhy/Gaussian-SLAM/run_slam.py", line 109, in <module>
    gslam.run()
  File "/home/t5820/yhy/Gaussian-SLAM/src/entities/gaussian_slam.py", line 137, in run
    estimated_c2w = self.tracker.track(
  File "/home/t5820/yhy/Gaussian-SLAM/src/entities/tracker.py", line 138, in track
    render_settings = get_render_settings(
  File "/home/t5820/yhy/Gaussian-SLAM/src/utils/utils.py", line 93, in get_render_settings
    cam_center = torch.inverse(w2c)[:3, 3]
torch._C._LinAlgError: linalg.inv: The diagonal element 2 is zero, the inversion could not be completed because the input matrix is singular.

Or sometimes shows:

Tracking frame 110
iter: 0, color_loss: 12819.21484, depth_loss: 742.16980 , cam_quad_err: 0.00011, cam_trans_err: 0.00277
iter: 20, color_loss: 11323.96875, depth_loss: 547.08099 , cam_quad_err: 0.00012, cam_trans_err: 0.00110
iter: 40, color_loss: 10657.52344, depth_loss: 585.00720 , cam_quad_err: 0.00014, cam_trans_err: 0.00095
frame_id: 110, cam_quad_err: 0.00013, cam_trans_err: 0.00103 , cam_quad_err: 0.00013, cam_trans_err: 0.00103

Mapping frame 110
Number of added points:  29225
Gaussian model size 829976
Traceback (most recent call last):
  File "/home/t5820/yhy/Gaussian-SLAM/run_slam.py", line 109, in <module>
    gslam.run()
  File "/home/t5820/yhy/Gaussian-SLAM/src/entities/gaussian_slam.py", line 152, in run
    opt_dict = self.mapper.map(frame_id, estimate_c2w, gaussian_model, new_submap)
  File "/home/t5820/yhy/Gaussian-SLAM/src/entities/mapper.py", line 236, in map
    opt_dict = self.optimize_submap([(frame_id, keyframe)] + self.keyframes, gaussian_model, max_iterations)
  File "/home/t5820/yhy/Gaussian-SLAM/src/entities/mapper.py", line 142, in optimize_submap
    image[:, mask], gt_image[:, mask]) + self.opt.lambda_dssim * (1.0 - ssim(image, gt_image))
RuntimeError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with "TORCH_USE_CUDA_DSA" to enable device-side assertions.

Howie-Ye avatar Jan 17 '25 04:01 Howie-Ye

Hey there, thanks for your interest in our work!

We are very confident that with the "normal" setup, our code is working on Replica and other reported datasets - we tested this many times.

By any chance, did you change any of the hyperparameters? Also, what are your open3d and Cuda versions?

I'm not sure we have the meshes somewhere, but we will try to look for them.

VladimirYugay avatar Jan 19 '25 12:01 VladimirYugay

Hey there, thanks for your interest in our work!

We are very confident that with the "normal" setup, our code is working on Replica and other reported datasets - we tested this many times.

By any chance, did you change any of the hyperparameters? Also, what are your open3d and Cuda versions?

I'm not sure we have the meshes somewhere, but we will try to look for them.

Thank you for your reply. I have successfully tested the code on the TUM and ScanNet datasets without modifying any of the hyperparameters. However, I encountered the two previously mentioned bugs alternately when running the code on the Replica dataset. Here is the environment:

(Gslam) t5820@t5820:~/yhy/Gaussian-SLAM$ conda list | grep cuda
cuda-cccl                 12.1.55                       0    nvidia/label/cuda-12.1.0
cuda-command-line-tools   12.1.0                        0    nvidia/label/cuda-12.1.0
cuda-compiler             12.1.0                        0    nvidia/label/cuda-12.1.0
cuda-cudart               12.1.55                       0    nvidia/label/cuda-12.1.0
cuda-cudart-dev           12.1.55                       0    nvidia/label/cuda-12.1.0
cuda-cudart-static        12.1.55                       0    nvidia/label/cuda-12.1.0
cuda-cuobjdump            12.1.55                       0    nvidia/label/cuda-12.1.0
cuda-cupti                12.1.62                       0    nvidia/label/cuda-12.1.0
cuda-cupti-static         12.1.62                       0    nvidia/label/cuda-12.1.0
cuda-cuxxfilt             12.1.55                       0    nvidia/label/cuda-12.1.0
cuda-documentation        12.1.55                       0    nvidia/label/cuda-12.1.0
cuda-driver-dev           12.1.55                       0    nvidia/label/cuda-12.1.0
cuda-gdb                  12.1.55                       0    nvidia/label/cuda-12.1.0
cuda-libraries            12.1.0                        0    nvidia/label/cuda-12.1.0
cuda-libraries-dev        12.1.0                        0    nvidia/label/cuda-12.1.0
cuda-libraries-static     12.1.0                        0    nvidia/label/cuda-12.1.0
cuda-nsight               12.1.55                       0    nvidia/label/cuda-12.1.0
cuda-nsight-compute       12.1.0                        0    nvidia/label/cuda-12.1.0
cuda-nvcc                 12.1.66                       0    nvidia/label/cuda-12.1.0
cuda-nvdisasm             12.1.55                       0    nvidia/label/cuda-12.1.0
cuda-nvml-dev             12.1.55                       0    nvidia/label/cuda-12.1.0
cuda-nvprof               12.1.55                       0    nvidia/label/cuda-12.1.0
cuda-nvprune              12.1.55                       0    nvidia/label/cuda-12.1.0
cuda-nvrtc                12.1.55                       0    nvidia/label/cuda-12.1.0
cuda-nvrtc-dev            12.1.55                       0    nvidia/label/cuda-12.1.0
cuda-nvrtc-static         12.1.55                       0    nvidia/label/cuda-12.1.0
cuda-nvtx                 12.1.66                       0    nvidia/label/cuda-12.1.0
cuda-nvvp                 12.1.55                       0    nvidia/label/cuda-12.1.0
cuda-opencl               12.1.56                       0    nvidia/label/cuda-12.1.0
cuda-opencl-dev           12.1.56                       0    nvidia/label/cuda-12.1.0
cuda-profiler-api         12.1.55                       0    nvidia/label/cuda-12.1.0
cuda-runtime              12.1.0                        0    nvidia
cuda-sanitizer-api        12.1.55                       0    nvidia/label/cuda-12.1.0
cuda-toolkit              12.1.0                        0    nvidia/label/cuda-12.1.0
cuda-tools                12.1.0                        0    nvidia/label/cuda-12.1.0
cuda-visual-tools         12.1.0                        0    nvidia/label/cuda-12.1.0
faiss-gpu                 1.8.0           py3.10_h4c7d538_0_cuda12.1.1    pytorch
gds-tools                 1.6.0.25                      0    nvidia/label/cuda-12.1.0
libcublas                 12.1.0.26                     0    nvidia/label/cuda-12.1.0
libcublas-dev             12.1.0.26                     0    nvidia/label/cuda-12.1.0
libcublas-static          12.1.0.26                     0    nvidia/label/cuda-12.1.0
libcufft                  11.0.2.4                      0    nvidia/label/cuda-12.1.0
libcufft-dev              11.0.2.4                      0    nvidia/label/cuda-12.1.0
libcufft-static           11.0.2.4                      0    nvidia/label/cuda-12.1.0
libcufile                 1.6.0.25                      0    nvidia/label/cuda-12.1.0
libcufile-dev             1.6.0.25                      0    nvidia/label/cuda-12.1.0
libcufile-static          1.6.0.25                      0    nvidia/label/cuda-12.1.0
libcurand                 10.3.2.56                     0    nvidia/label/cuda-12.1.0
libcurand-dev             10.3.2.56                     0    nvidia/label/cuda-12.1.0
libcurand-static          10.3.2.56                     0    nvidia/label/cuda-12.1.0
libcusolver               11.4.4.55                     0    nvidia/label/cuda-12.1.0
libcusolver-dev           11.4.4.55                     0    nvidia/label/cuda-12.1.0
libcusolver-static        11.4.4.55                     0    nvidia/label/cuda-12.1.0
libcusparse               12.0.2.55                     0    nvidia/label/cuda-12.1.0
libcusparse-dev           12.0.2.55                     0    nvidia/label/cuda-12.1.0
libcusparse-static        12.0.2.55                     0    nvidia/label/cuda-12.1.0
libfaiss                  1.8.0           h046e95b_0_cuda12.1.1    pytorch
libnpp                    12.0.2.50                     0    nvidia/label/cuda-12.1.0
libnpp-dev                12.0.2.50                     0    nvidia/label/cuda-12.1.0
libnpp-static             12.0.2.50                     0    nvidia/label/cuda-12.1.0
libnvjitlink-dev          12.1.55                       0    nvidia/label/cuda-12.1.0
libnvjpeg                 12.1.0.39                     0    nvidia/label/cuda-12.1.0
libnvjpeg-dev             12.1.0.39                     0    nvidia/label/cuda-12.1.0
libnvjpeg-static          12.1.0.39                     0    nvidia/label/cuda-12.1.0
libnvvm-samples           12.1.55                       0    nvidia/label/cuda-12.1.0
nsight-compute            2023.1.0.15                   0    nvidia/label/cuda-12.1.0
pytorch                   2.1.2           py3.10_cuda12.1_cudnn8.9.2_0    pytorch
pytorch-cuda              12.1                 ha16c6d3_6    pytorch
pytorch-mutex             1.0                        cuda    pytorch

(Gslam) t5820@t5820:~/yhy/Gaussian-SLAM$ conda list | grep open3d
open3d                    0.18.0                   pypi_0    pypi

(Gslam) t5820@t5820:~/yhy/Gaussian-SLAM$ python --version
Python 3.10.16

(Gslam) t5820@t5820:~/yhy/Gaussian-SLAM$ pip list | grep numpy
numpy                      1.26.4

Howie-Ye avatar Jan 24 '25 09:01 Howie-Ye

Hmm, that's strange. Can you try running with gt_camera: True?

VladimirYugay avatar Jan 26 '25 16:01 VladimirYugay

Hmm, that's strange. Can you try running with gt_camera: True?

The issue still persists. Additionally, I noticed that gt_camera is neither used in the code nor referenced in the update_config_with_args function, even though it is defined in the config and args.

Tracking frame 275
iter: 0, color_loss: 13580.01953, depth_loss: 1055.05933 , cam_quad_err: 0.00010, cam_trans_err: 0.00278
iter: 20, color_loss: 0.00000, depth_loss: 0.00000 , cam_quad_err: nan, cam_trans_err: nan
iter: 40, color_loss: 0.00000, depth_loss: 0.00000 , cam_quad_err: nan, cam_trans_err: nan
frame_id: 275, cam_quad_err: nan, cam_trans_err: nan , cam_quad_err: nan, cam_trans_err: nan

Mapping frame 275
Traceback (most recent call last):
  File "/home/t5820/yhy/Gaussian-SLAM/run_slam.py", line 109, in <module>
    gslam.run()
  File "/home/t5820/yhy/Gaussian-SLAM/src/entities/gaussian_slam.py", line 152, in run
    opt_dict = self.mapper.map(frame_id, estimate_c2w, gaussian_model, new_submap)
  File "/home/t5820/yhy/Gaussian-SLAM/src/entities/mapper.py", line 234, in map
    "render_settings": get_render_settings(
  File "/home/t5820/yhy/Gaussian-SLAM/src/utils/utils.py", line 93, in get_render_settings
    cam_center = torch.inverse(w2c)[:3, 3]
torch._C._LinAlgError: linalg.inv: The diagonal element 2 is zero, the inversion could not be completed because the input matrix is singular.

Howie-Ye avatar Jan 27 '25 03:01 Howie-Ye

Thanks for spotting the gt flag - it is not functional.

We've double-checked and could reproduce the results obtained on Replica without this error.

It is a bit hard to debug it in this way. From the log you attached, I see that colour and depth loss are very big, so go to nan. Moreover, they're valid on the very first iteration. Can you double-check that the input RGBD maps for these frames are not corrupted? Also, can you check the colour and depth renders of the Gaussians during the tracking step for every iteration?

VladimirYugay avatar Jan 31 '25 14:01 VladimirYugay

Thanks for spotting the gt flag - it is not functional.

We've double-checked and could reproduce the results obtained on Replica without this error.

It is a bit hard to debug it in this way. From the log you attached, I see that colour and depth loss are very big, so go to nan. Moreover, they're valid on the very first iteration. Can you double-check that the input RGBD maps for these frames are not corrupted? Also, can you check the colour and depth renders of the Gaussians during the tracking step for every iteration?

I guess that I have identified the issue and modified the code to address it. In the tracking function, you optimize the quaternion and then convert it to a rotation matrix using the build_rotation function. However, for some reason (such as numerical stability), the resulting matrix does not satisfy the properties of a rotation matrix. Therefore, I cached opt_cam_rot and opt_cam_trans to allow for rollback and added validation for the generated cur_rel_w2c.

Additionally, there is another issue: since you backward gradient first, the current total_loss corresponds to the w2c from the previous iteration. Consequently, the best_w2c here might be suboptimal.

https://github.com/VladimirYugay/Gaussian-SLAM/blob/eaec10d73ce7511563882b8856896e06d1f804e3/src/entities/tracker.py#L174-L197

Howie-Ye avatar Feb 06 '25 03:02 Howie-Ye

@Howie-Ye Hello, I have also encountered the same problem as you. Can you share the code to solve it?I would be very grateful if you could help me

lee12332 avatar Feb 27 '25 14:02 lee12332

@Howie-Ye Hello, I have also encountered the same problem as you. Can you share the code to solve it?I would be very grateful if you could help me


        opt_rot_cache, opt_trans_cache = opt_cam_rot.clone(), opt_cam_trans.clone()

        for iter in range(num_iters):
            color_loss, depth_loss, _, _, _, = self.compute_losses(
                gaussian_model, render_settings, opt_cam_rot, opt_cam_trans, gt_color, gt_depth, depth_mask)

            total_loss = (self.w_color_loss * color_loss + (1 - self.w_color_loss) * depth_loss)
            total_loss.backward()
            gaussian_model.optimizer.step()
            gaussian_model.optimizer.zero_grad(set_to_none=True)

            with torch.no_grad():
                cur_quat, cur_trans = F.normalize(opt_cam_rot[None].clone().detach()), opt_cam_trans.clone().detach()
                rot_tmp = build_rotation(cur_quat)[0]
                if not self.is_rotation_matrix(rot_tmp):
                    opt_cam_trans = torch.nn.Parameter(opt_trans_cache)
                    opt_cam_rot = torch.nn.Parameter(opt_rot_cache)
                    cur_quat, cur_trans = F.normalize(opt_cam_rot[None].clone().detach()), opt_cam_trans.clone().detach()
                    rot_tmp = build_rotation(cur_quat)[0]
                else:
                    opt_rot_cache, opt_trans_cache = opt_cam_rot.clone(), opt_cam_trans.clone()
                    if total_loss.item() < current_min_loss:
                        current_min_loss = total_loss.item()
                        best_w2c = torch.eye(4)
                        best_w2c[:3, :3] = build_rotation(F.normalize(opt_cam_rot[None].clone().detach().cpu()))[0]
                        best_w2c[:3, 3] = opt_cam_trans.clone().detach().cpu()

                cur_rel_w2c = torch.eye(4)

                cur_rel_w2c[:3, :3] = rot_tmp
                cur_rel_w2c[:3, 3] = cur_trans


    def is_rotation_matrix(self, R):
        
        identity = torch.eye(3, device=R.device)
        RtR = torch.mm(R.t(), R)
        orthogonality_check = torch.allclose(RtR, identity, atol=1e-6)

        det_check = torch.isclose(torch.det(R), torch.tensor(1.0), atol=1e-6)

        return orthogonality_check and det_check

Try to see if you can solve the problem.

Howie-Ye avatar Mar 01 '25 08:03 Howie-Ye

@Howie-Ye I implemented a similar method and found that while it can prevent crashes caused by nan, it can also prevent the optimizer's gradients from propagating and accumulating tracking errors. I tried resetting the optimizer, but it resulted in the same nan as the previous iteration due to the same input. I also tried adding random perturbations in Odometry, and the result was the same.

lee12332 avatar Mar 01 '25 11:03 lee12332

It is unfortunate you are seeing these issues - we didn't encounter them ourselves. If it is a numerical error, could you try using this library? They have standard functions to convert back-and-forth quaternions, rotations matrices, etc.

VladimirYugay avatar Mar 01 '25 15:03 VladimirYugay