DSGN2
DSGN2 copied to clipboard
Some questions about 3DGV, PSV and Front-Surface Depth Head
Hi! Thanks for sharing your awesome work, but i am so confused about the coordinate system in your code. Firstly, depth-wise cost volumes are build in PSV:
cost_raw = self.build_cost(left_stereo_feat, right_stereo_feat,
None, None, downsampled_disp, psv_disps_channels.to(torch.int32))
Then, a 3d mesh grid in pseudo lidar coordinate is generated:
def prepare_coordinates_3d(self, point_cloud_range, voxel_size, grid_size, sample_rate=(1, 1, 1)):
self.X_MIN, self.Y_MIN, self.Z_MIN = point_cloud_range[:3]
self.X_MAX, self.Y_MAX, self.Z_MAX = point_cloud_range[3:]
self.VOXEL_X_SIZE, self.VOXEL_Y_SIZE, self.VOXEL_Z_SIZE = voxel_size
self.GRID_X_SIZE, self.GRID_Y_SIZE, self.GRID_Z_SIZE = grid_size.tolist()
self.VOXEL_X_SIZE /= sample_rate[0]
self.VOXEL_Y_SIZE /= sample_rate[1]
self.VOXEL_Z_SIZE /= sample_rate[2]
self.GRID_X_SIZE *= sample_rate[0]
self.GRID_Y_SIZE *= sample_rate[1]
self.GRID_Z_SIZE *= sample_rate[2]
zs = torch.linspace(self.Z_MIN + self.VOXEL_Z_SIZE / 2., self.Z_MAX - self.VOXEL_Z_SIZE / 2.,
self.GRID_Z_SIZE, dtype=torch.float32)
ys = torch.linspace(self.Y_MIN + self.VOXEL_Y_SIZE / 2., self.Y_MAX - self.VOXEL_Y_SIZE / 2.,
self.GRID_Y_SIZE, dtype=torch.float32)
xs = torch.linspace(self.X_MIN + self.VOXEL_X_SIZE / 2., self.X_MAX - self.VOXEL_X_SIZE / 2.,
self.GRID_X_SIZE, dtype=torch.float32)
zs, ys, xs = torch.meshgrid(zs, ys, xs)
coordinates_3d = torch.stack([xs, ys, zs], dim=-1)
self.coordinates_3d = coordinates_3d.float()
and 3D world mesh grid to camera frustum space (I believe is via the torch.cat
operation?):
def compute_mapping(c3d, image_shape, calib_proj, depth_range, pose_transform=None):
import pdb; pdb.set_trace()
coord_img = project_rect_to_image(
c3d,
calib_proj,
pose_transform)
coord_img = torch.cat(
[coord_img, c3d[..., 2:]], dim=-1)
crop_x1, crop_x2 = 0, image_shape[1]
crop_y1, crop_y2 = 0, image_shape[0]
norm_coord_img = (coord_img - torch.as_tensor([crop_x1, crop_y1, depth_range[0]], device=coord_img.device)) / torch.as_tensor(
[crop_x2 - 1 - crop_x1, crop_y2 - 1 - crop_y1, depth_range[1] - depth_range[0]], device=coord_img.device)
# resize to [-1, 1]
norm_coord_img = norm_coord_img * 2. - 1.
return coord_img, norm_coord_img
But I'm really confused by the following grid_sample
operations, such as:
Voxel = F.grid_sample(out, norm_coord_imgs, align_corners=True)
and:
Voxel_2D = self.build_3d_geometry_volume(left_sem_feat, norm_coord_imgs, voxel_disps)
So this means that out(cost0)
and left_sem_feat
are both in image coordinate, and are mapping to normalized camera frustum space by filling the grid (for cost volume
, its value is sampled in the volume, while for sem_fea
, its grid is filled by replication along the depth axis )?
After that, the Voxel
is 'grid_sampled' again in the last depth estimation stage:
PSV_from_3dgv = F.grid_sample(Voxel, norm_coordinates_psv_to_3d)
It would be so helpful if you shared more details about the coordinate system and coordinate transformation in your code :confounded: Thank you so much.
Hi, Thanks for your interest in the work.
Voxel = F.grid_sample(out, norm_coord_imgs, align_corners=True)
This transformation maps the PSV with coordinates (u, v, d) to a 3D voxel grid with coordinates (x, y, z).
Voxel_2D = self.build_3d_geometry_volume(left_sem_feat, norm_coord_imgs, voxel_disps)
This transformation maps the camera features (HxWxC) to a 3D voxel grid with coordinates (x, y, z).
PSV_from_3dgv = F.grid_sample(Voxel, norm_coordinates_psv_to_3d)
This transformation maps the 3D voxel grid with coordinate (x, y, z) back to the frustum space (u, v, d) and compute the depth map loss. (Front Surface Depth Head)
3D voxel grid with coordinates (x, y, z).
Thanks for your reply! It really helps a lot. So the 3D voxel grid coordinate (x, y, z) actually means (depth, width, height) in pseudo lidar coordinate?
Yes, just note that The voxel grid is with shape [Channel, Height (z), Width (y), Depth (x)] where LiDAR coordinate should be [x (forward), y (left), z (up)].
By the way, have you tried train & eval on max_disp = 16, surprisingly, I got results very close to that train & eval on max_disp=288, is it might caused by overfitting?
Hi, I think it should not get similar results. Could you show the shape of your generated feature size (3DGV) with these modifications? Perhaps it is due to bugs like duplicated config definitions.