gaussian-splatting
gaussian-splatting copied to clipboard
Project a pixel accurately to 3D
Hi all,
I'm currently working on a project where I need to project a set of target pixels (those selected based on certain error metrics) into 3D space. The idea is to estimate their 3D locations and then retrieve nearby Gaussians within a certain radius. Since I have both camera intrinsics and extrinsics, my approach is to invert the standard projection pipeline — i.e., use the known pixel coordinates and depth values to back-project into 3D world coordinates.
Here's the function I'm using for this:
def _project_depth(self, u: int, v: int, depth: float, camera) -> np.ndarray:
if depth <= 0:
return None
W, H = camera.image_width, camera.image_height
# intrinsics
fy = fov2focal(camera.FoVy, camera.image_height)
fx = fov2focal(camera.FoVx, camera.image_width)
cx, cy = W/2.0, H/2.0
# camera space
x_cam = (u - cx) * depth / fx
y_cam = -(v - cy) * depth / fy
z_cam = -depth
# to homogeneous and world
cam_coords = torch.tensor([x_cam, y_cam, z_cam, 1.0],
dtype=torch.float32,
device=camera.data_device)
cam_to_world = camera.world_view_transform.inverse().to(torch.float32)
world = cam_to_world @ cam_coords
return world[:3].cpu().numpy()
However, I'm noticing that the 3D points produced by this projection (shown in gray in this figure) are significantly misaligned from the actual Gaussians (shown in blue). I expected them to land close to the surface of the scene (where Gaussians are located), but instead they're often far off.
After some investigation, I suspect the problem may lie in the depth values I'm using. I'm currently extracting them from the Gaussian renderer. Looking into the forward.cu code, I noticed the depth is computed as: expected_invdepth += depths[collected_id[j]] * alpha * T;
This is an alpha-blended inverse depth, which may not represent the true surface depth along the ray. I also experimented with using the depth of the first Gaussian hit with opacity > 0.1. That improved the projection a bit, but the result is still inaccurate.
At this point, I suspect that I may either be:
-
Misinterpreting the depth values from the renderer,
-
Missing a correction in the transformation chain, or
-
Misaligning coordinate conventions (e.g., sign conventions, handedness, etc.).
I'd really appreciate any insights or suggestions. Have I misunderstood how to derive true surface depth from the renderer? Or is there an error in how I back-project pixels into world space?
Thanks in advance.
I think it's caused by ‘misaligning coordinate convention’. The orientation of the coordinate axes in your implementation is as follows.
- x-axis: right
- y-axis: top
- z-axis: back
However, the local camera coordinate system described in the COLMAP documentation is slightly different. Therefore, if you are using data in the COLMAP format, the axis order should be changed as follows.
- x-axis : right
- y-axis : bottom
- z-axis : front
x_cam = (u - cx) * depth / fx
y_cam = (v - cy) * depth / fy
z_cam = depth
@Jwonno Yes, that seems to be one of the issue as well. I changed my code as follow: from utils.graphics_utils import fov2focal, geom_transform_points
def _project_depth(self, u: int, v: int, depth: float, camera) -> np.ndarray:
fx = fov2focal(cam.FoVx, cam.image_width)
fy = fov2focal(cam.FoVy, cam.image_height)
if hasattr(cam, 'cx') and hasattr(cam, 'cy'):
original_w = 2044 * 2
scale_factor = cam.image_width / original_w
cx = cam.cx * scale_factor
cy = cam.cy * scale_factor
else:
cx = cam.image_width * 0.5
cy = cam.image_height * 0.5
# to camera space
x_cam = (u - cx) * depth / fx
y_cam = (v - cy) * depth / fy
z_cam = depth # positive depth
# to world coordinates
pts_cam = torch.tensor([[x_cam, y_cam, z_cam]],
dtype=torch.float32, device=cam.data_device)
# Camera-to-world transformation
cam_to_world = cam.world_view_transform.inverse()
pts_world = geom_transform_points(pts_cam, cam_to_world)
return pts_world[0].detach().cpu().numpy()
After these changes, I have better results. As shown in this figure, the projected pixels are landing very close to the gaussians and are also in the same region. But using the same setting, sometimes I am still getting incorrect projections as shown in this figure. I am suspecting this might be due to incorrect depths. Because the depth estimates from renderer is not perfect anyway. If you have any insights, that would be great.
Thanks for your help in advance!
As far as I know, cam.world_view_transfrom is stored as the transpose of the World to Camera transformation. Therefore, to perform a Camera-wo-World transfomation, you need to compute cam_to_world = cam.world_view_transform.T.inverse().