DSGN2 Some questions about 3DGV, PSV and Front-Surface Depth Head

Hi! Thanks for sharing your awesome work, but i am so confused about the coordinate system in your code. Firstly, depth-wise cost volumes are build in PSV:

cost_raw = self.build_cost(left_stereo_feat, right_stereo_feat,
                None, None, downsampled_disp, psv_disps_channels.to(torch.int32))

Then, a 3d mesh grid in pseudo lidar coordinate is generated:

def prepare_coordinates_3d(self, point_cloud_range, voxel_size, grid_size, sample_rate=(1, 1, 1)):
        self.X_MIN, self.Y_MIN, self.Z_MIN = point_cloud_range[:3]
        self.X_MAX, self.Y_MAX, self.Z_MAX = point_cloud_range[3:]
        self.VOXEL_X_SIZE, self.VOXEL_Y_SIZE, self.VOXEL_Z_SIZE = voxel_size
        self.GRID_X_SIZE, self.GRID_Y_SIZE, self.GRID_Z_SIZE = grid_size.tolist()

        self.VOXEL_X_SIZE /= sample_rate[0]
        self.VOXEL_Y_SIZE /= sample_rate[1]
        self.VOXEL_Z_SIZE /= sample_rate[2]

        self.GRID_X_SIZE *= sample_rate[0]
        self.GRID_Y_SIZE *= sample_rate[1]
        self.GRID_Z_SIZE *= sample_rate[2]

        zs = torch.linspace(self.Z_MIN + self.VOXEL_Z_SIZE / 2., self.Z_MAX - self.VOXEL_Z_SIZE / 2.,
                            self.GRID_Z_SIZE, dtype=torch.float32)
        ys = torch.linspace(self.Y_MIN + self.VOXEL_Y_SIZE / 2., self.Y_MAX - self.VOXEL_Y_SIZE / 2.,
                            self.GRID_Y_SIZE, dtype=torch.float32)
        xs = torch.linspace(self.X_MIN + self.VOXEL_X_SIZE / 2., self.X_MAX - self.VOXEL_X_SIZE / 2.,
                            self.GRID_X_SIZE, dtype=torch.float32)
        zs, ys, xs = torch.meshgrid(zs, ys, xs)
        coordinates_3d = torch.stack([xs, ys, zs], dim=-1)
        self.coordinates_3d = coordinates_3d.float()

and 3D world mesh grid to camera frustum space (I believe is via the torch.cat operation?):

def compute_mapping(c3d, image_shape, calib_proj, depth_range, pose_transform=None):
        import pdb; pdb.set_trace()
        coord_img = project_rect_to_image(
            c3d,
            calib_proj,
            pose_transform)
        coord_img = torch.cat(
            [coord_img, c3d[..., 2:]], dim=-1)
        crop_x1, crop_x2 = 0, image_shape[1]
        crop_y1, crop_y2 = 0, image_shape[0]
        norm_coord_img = (coord_img - torch.as_tensor([crop_x1, crop_y1, depth_range[0]], device=coord_img.device)) / torch.as_tensor(
            [crop_x2 - 1 - crop_x1, crop_y2 - 1 - crop_y1, depth_range[1] - depth_range[0]], device=coord_img.device)
        # resize to [-1, 1]
        norm_coord_img = norm_coord_img * 2. - 1.
        return coord_img, norm_coord_img

But I'm really confused by the following grid_sample operations, such as:

Voxel = F.grid_sample(out, norm_coord_imgs, align_corners=True)

and:

Voxel_2D = self.build_3d_geometry_volume(left_sem_feat, norm_coord_imgs, voxel_disps)

So this means that out(cost0) and left_sem_feat are both in image coordinate, and are mapping to normalized camera frustum space by filling the grid (for cost volume, its value is sampled in the volume, while for sem_fea, its grid is filled by replication along the depth axis )? After that, the Voxel is 'grid_sampled' again in the last depth estimation stage:

PSV_from_3dgv = F.grid_sample(Voxel, norm_coordinates_psv_to_3d)

It would be so helpful if you shared more details about the coordinate system and coordinate transformation in your code :confounded: Thank you so much.

Oct 11 '22 02:10 SibylGao

Hi, Thanks for your interest in the work. Voxel = F.grid_sample(out, norm_coord_imgs, align_corners=True) This transformation maps the PSV with coordinates (u, v, d) to a 3D voxel grid with coordinates (x, y, z).

Voxel_2D = self.build_3d_geometry_volume(left_sem_feat, norm_coord_imgs, voxel_disps) This transformation maps the camera features (HxWxC) to a 3D voxel grid with coordinates (x, y, z).

PSV_from_3dgv = F.grid_sample(Voxel, norm_coordinates_psv_to_3d) This transformation maps the 3D voxel grid with coordinate (x, y, z) back to the frustum space (u, v, d) and compute the depth map loss. (Front Surface Depth Head)

Oct 18 '22 16:10 chenyilun95

3D voxel grid with coordinates (x, y, z).

Thanks for your reply! It really helps a lot. So the 3D voxel grid coordinate (x, y, z) actually means (depth, width, height) in pseudo lidar coordinate?

Oct 25 '22 02:10 SibylGao

Yes, just note that The voxel grid is with shape [Channel, Height (z), Width (y), Depth (x)] where LiDAR coordinate should be [x (forward), y (left), z (up)].

Oct 25 '22 03:10 chenyilun95

By the way, have you tried train & eval on max_disp = 16, surprisingly, I got results very close to that train & eval on max_disp=288, is it might caused by overfitting?

Oct 28 '22 03:10 SibylGao

Hi, I think it should not get similar results. Could you show the shape of your generated feature size (3DGV) with these modifications? Perhaps it is due to bugs like duplicated config definitions.

Nov 05 '22 07:11 chenyilun95

DSGN2 DSGN2 copied to clipboard

Some questions about 3DGV, PSV and Front-Surface Depth Head

DSGN2
DSGN2 copied to clipboard