nerfstudio icon indicating copy to clipboard operation
nerfstudio copied to clipboard

3D to 2D bounding box projection

Open Alireza1044 opened this issue 1 year ago • 2 comments

Hi,

I wanted to project a 3D bounding box to 2D. However when I visualize the bounding boxes in the image, they do not fit correctly. Here are the steps I took:

  1. Draw and visualize the bounding box in the nerfstudio viewer using this API in the viser: https://viser.studio/server/#viser.ViserServer.add_box
  2. Save the center, dimensions, and the Quaternion rotation of the bounding box
  3. Load the frame, frame transform matrix (camera extrinsics) and camera intrinsics using the default dataloader.
  4. To my understanding, transform matrix is of shape (3x4) and is in camera to world notation, so I append [0,0,0,1] to the last row and take the inverse of it to get world to camera projections.
  5. compute the rotation matrix R as:
qw, qx, qy, qz = quaternion
R = [[1 - 2*qy**2 - 2*qz**2, 2*qx*qy - 2*qz*qw, 2*qx*qz + 2*qy*qw],
        [2*qx*qy + 2*qz*qw, 1 - 2*qx**2 - 2*qz**2, 2*qy*qz - 2*qx*qw],
        [2*qx*qz - 2*qy*qw, 2*qy*qz + 2*qx*qw, 1 - 2*qx**2 - 2*qy**2]]
  1. Compute the 8 corners of the bounding box:
w, h, d = dimensions / 2 
vertices = torch.tensor([
                [-w, -h, -d],
                [+w, -h, -d],
                [-w, +h, -d],
                [+w, +h, -d],
                [-w, -h, +d],
                [+w, -h, +d],
                [-w, +h, +d],
                [+w, +h, +d]])
  1. Compute corners in world coordinates:
v_world = torch.matmul(vertices, R.T) + position
v_world_homogeneous = torch.cat([vertices_world, torch.ones(8,1)], dim=1)
extrinsics = camera.camera_to_worlds
extrinsics = torch.cat((extrinsics, torch.tensor([[[0.,0.,0.,1.]]])),dim=1) # a 4x4 matrix
extrinsics = torch.linalg.inv(extrinsics)
K = camera.get_intrinsics_matrices()
K =  torch.cat((K, torch.tensor([[[0.,0.,1.]]])),dim=-1) # a 3x4 matrix
camera_parameters = K @ extrinsics
v_image = v_world_homogeneous @ camera_parameters.T
v_image = v_image[:, :2] / v_image[:, 2:3]
xmin, ymin = v_image[:,0].min(), v_image[:,1].min()
xmax, ymax = v_image[:,0].max(), v_image[:,1].max()

and xmin, ymin, xmax, ymax will be the top left and bottom right of the projected bounding box. This is the result I get: 1

The projections are incorrect for all of images in the dataset. Any idea what could be wrong?

Alireza1044 avatar Mar 19 '24 08:03 Alireza1044

Potentially there's an issue with camera downscaling? the train dataset downscales cameras automatically and maybe the resolution of the image you're visualizing doesn't match these downscaled intrinsics?

kerrj avatar Mar 22 '24 16:03 kerrj

I didn't specify the downscaling option during training. To my understanding, the trainer uses the original resolution of images by default.

Alireza1044 avatar Mar 22 '24 18:03 Alireza1044