CoHFF How to convert depth to preliminary voxels

In the code, the voxels are read directly from the file. The paper doesn't explain how it works

Sep 22 '24 03:09 Zz-dong

In the provided depth ground truth, the shape of the depth is 800*600. However the input shape of the model in CompConvModel is (batch_size,100,100,8). Why is there such a big difference between the two?

Oct 14 '24 08:10 Zz-dong

Is there an error in the function documentation here? The get_pvox function is supposed to read the processed voxels. So, how are these voxels derived from depth?

Oct 14 '24 09:10 Zz-dong

Hi, you’re correct. The 2D depth images will be projected into 3D space using camera calibration parameters and then processed to create voxel occupancy data. The get_pvox function will then read these voxels calculated from the depth projections.

Oct 14 '24 09:10 rruisong

Thank you for your response! Can you describe how to project a 2D depth image into 3D space? Is there code for this, it would help me a lot!

Oct 14 '24 13:10 Zz-dong

Here is one code example, hope it can be helpful:

# Convert points to voxels:
def point2vox(p_xyz, 
              max_bound=np.array([20, 20, 0.9]), 
              min_bound=np.array([-20, -20, -2.5]), 
              grid_size=np.array([100, 100, 8]), 
              fill_label=0):
    """
    Convert point cloud coordinates to voxel grid representation.
    
    Parameters:
        p_xyz (np.ndarray): Input point cloud (N, 3) with x, y, z coordinates.
        max_bound (np.ndarray): Maximum boundary for x, y, z coordinates (default: [20, 20, 0.9]).
        min_bound (np.ndarray): Minimum boundary for x, y, z coordinates (default: [-20, -20, -2.5]).
        grid_size (np.ndarray): Size of the voxel grid for each axis (default: [100, 100, 8]).
        fill_label (int): Default fill label for empty voxels (default: 0).
    
    Returns:
        np.ndarray: Processed label grid representing voxelized point cloud.
    """
    
    # Compute the intervals for grid indexing
    crop_range = max_bound - min_bound
    intervals = crop_range / (grid_size - 1)
    
    # Initialize empty voxel occupancy
    empty_vox = np.full((p_xyz.shape[0], 1), 2, dtype=int)
    
    # Obtain occupancy label and grid indices
    occ_label, grid_indices_float = point_cut(empty_vox, p_xyz, min_bound, max_bound, intervals)
    grid_indices = np.floor(grid_indices_float).astype(int)

    # Create the processed label array
    processed_label = np.full(grid_size, fill_label, dtype=np.uint8)
    label_voxel_pairs = np.hstack([grid_indices, occ_label.reshape(-1, 1)])
    
    # Sort the pairs to ensure correct processing order
    label_voxel_pairs = label_voxel_pairs[np.lexsort((grid_indices[:, 0], 
                                                      grid_indices[:, 1], 
                                                      grid_indices[:, 2]))]
    
    # Apply occupancy labeling to the grid
    processed_label = nb_process_label(processed_label.copy(), label_voxel_pairs)
    
    return processed_label


# Convert depth maps to voxels:
def depth2vox(depth_list, lidar_height=2.3):
    """
    Convert a list of depth maps to a 3D voxel representation.
    
    Parameters:
        depth_list (list of np.ndarray): List of depth maps.
        lidar_height (float): Height offset for the lidar sensor (reference of coordinate system).
        
    Returns:
        np.ndarray: Voxel grid representing combined depth maps.
    """
    
    H, W = depth_list[0].shape[2], depth_list[0].shape[3]
    k = np.array([[W / (2.0 * tan(100 * pi / 360.0)), 0, W / 2.0],
                  [0, W / (2.0 * tan(100 * pi / 360.0)), H / 2.0],
                  [0, 0, 1]])
    
    # 2D pixel coordinates
    pixel_length = W * H
    u_coord = np.tile(np.arange(W - 1, -1, -1), (H, 1)).reshape(pixel_length)
    v_coord = np.tile(np.arange(H - 1, -1, -1)[:, None], (1, W)).reshape(pixel_length)
    
    all_rotated_points = []
    
    for i_img, depth in enumerate(depth_list):
        depth = depth.detach().cpu().numpy()
        depth = np.argmax(depth[0], axis=0).astype(np.float32) * 0.4
        
        if i_img == 0:  # front
            theta_z, translation = 0, np.array([2.5, 0, 1.0 - lidar_height])
        elif i_img == 1:  # right
            theta_z, translation = -100, np.array([0, -0.3, 1.8 - lidar_height])
        elif i_img == 2:  # left
            theta_z, translation = 100, np.array([0, 0.3, 1.8 - lidar_height])
        elif i_img == 3:  # rear
            theta_z, translation = 180, np.array([-2.2, 0, 1.5 - lidar_height])
        else:
            raise ValueError("Too many images: expected at most 4.")
        
        # Rotation matrix for each orientation
        theta_z_rad = math.radians(theta_z)
        R_z = np.array([
            [np.cos(theta_z_rad), -np.sin(theta_z_rad), 0],
            [np.sin(theta_z_rad), np.cos(theta_z_rad), 0],
            [0, 0, 1]
        ])
        
        # Project depth to 3D points
        p2d = np.array([u_coord, v_coord, np.ones_like(u_coord)])
        p3d = np.dot(np.linalg.inv(k), p2d) * depth.flatten()
        p3d = p3d.T
        mask = p3d[:, 2] < 20  # Only keep points within 20m
        p3d = p3d[mask]
        
        # Transform points with rotation and translation
        rotated_points = (R_z @ p3d.T).T + translation
        all_rotated_points.extend(rotated_points)

    # Convert all accumulated rotated points to voxel grid
    vox = point2vox(np.array(all_rotated_points))
    vox[vox > 0] = 1  # Ensure occupancy value consistency
    
    return vox
```

Oct 15 '24 13:10 rruisong

@rruisong Thanks a lot for explanation! I have noticed however, many constants are in the code ，such as

        if i_img == 0:  # front
            theta_z, translation = 0, np.array([2.5, 0, 1.0 - lidar_height])
        elif i_img == 1:  # right
            theta_z, translation = -100, np.array([0, -0.3, 1.8 - lidar_height])
        elif i_img == 2:  # left
            theta_z, translation = 100, np.array([0, 0.3, 1.8 - lidar_height])
        elif i_img == 3:  # rear
            theta_z, translation = 180, np.array([-2.2, 0, 1.5 - lidar_height])

How are these constants acquired?

Oct 18 '24 01:10 SHA-4096

These parameters are determined by the configuration of your sensors. The values depend on how your cameras are positioned on the vehicles.

Oct 18 '24 06:10 rruisong

These parameters are determined by the configuration of your sensors. The values depend on how your cameras are positioned on the vehicles.

I see, but how exactly would the configuration file work? What fields are required to get these params and can they be found in the official OPV2V dataset?

Oct 18 '24 08:10 SHA-4096

Hi, you’re correct. The 2D depth images will be projected into 3D space using camera calibration parameters and then processed to create voxel occupancy data. The get_pvox function will then read these voxels calculated from the depth projections.

Hi, I am a little confused. In your project, through the depth network, which you recommend as CADNN, what is the result obtained by the depth network? Is it the 2D depth map you mentioned here, or the 3d voxel features extracted by the depth network? The depth ground truth you provided is used as groundtruth for the depth network and not as an input to this model, is that right?

Oct 21 '24 07:10 XIN499

Hi, Thank you for providing this excellent resource! I’m having some difficulty understanding the process for generating the frame_number_pvox.npy files used by get_pvox. Specifically, I have a few questions regarding the provided code:

Could you clarify the shape and contents of the depth_list variable? Is depth_list derived from previously stored files, or is it generated directly during the inference process of CaDDN (the recommended depth network)? If it’s generated directly during inference, from which layer of CaDDN is this data extracted? Understanding this would really help me grasp the purpose of the following code snippets:

H, W = depth_list[0].shape[2], depth_list[0].shape[3]  # Why use shape[2] and shape[3]?

and

for i_img, depth in enumerate(depth_list):
    depth = depth.detach().cpu().numpy()
    depth = np.argmax(depth[0], axis=0).astype(np.float32) * 0.4  # Why take depth[0] instead of just depth?

Additionally, would it be possible to share the relevant code and configuration files for training CaDDN on the Semantic-OPV2V depth dataset as proposed in your paper? This would greatly help streamline the process and save time that might otherwise be spent reconfiguring training for CaDDN.

Thank you so much for your support and for making this work accessible! I really appreciate your time and guidance.

Jan 02 '25 07:01 GO-Loc-GO

Hi, Thank you for providing this excellent resource! I’m having some difficulty understanding the process for generating the frame_number_pvox.npy files used by get_pvox. Specifically, I have a few questions regarding the provided code:

Could you clarify the shape and contents of the depth_list variable? Is depth_list derived from previously stored files, or is it generated directly during the inference process of CaDDN (the recommended depth network)? If it’s generated directly during inference, from which layer of CaDDN is this data extracted? Understanding this would really help me grasp the purpose of the following code snippets:

H, W = depth_list[0].shape[2], depth_list[0].shape[3] # Why use shape[2] and shape[3]?

and

for i_img, depth in enumerate(depth_list): depth = depth.detach().cpu().numpy() depth = np.argmax(depth[0], axis=0).astype(np.float32) * 0.4 # Why take depth[0] instead of just depth?

Additionally, would it be possible to share the relevant code and configuration files for training CaDDN on the Semantic-OPV2V depth dataset as proposed in your paper? This would greatly help streamline the process and save time that might otherwise be spent reconfiguring training for CaDDN.

Thank you so much for your support and for making this work accessible! I really appreciate your time and guidance.

Hello, I am currently attempting to reproduce this method and have encountered a similar issue. I was wondering if you have managed to resolve it and if we could discuss it further?

Mar 13 '25 11:03 muchen1021

@muchen1021 Sure, I'd welcome any further discussion. I sorta managed to resolve the problem, but not with 100% confidence because the results I got after training the completion network for a single vehicle don't look promising. You can see from one example below, all the predictions produced by the completion network I trained are basically the same, they omitted all obstacles on the ground surface in comparison with the groundtruth. I still haven't found out what is causing the problem, but it's largely possible that it's because the given code in this issue above is not used correctly to organize the depth data.

Groundtruth：

Prediction

Mar 14 '25 02:03 GO-Loc-GO

@GO-Loc-GO Thank you for your reply. I have not been able to reproduce the CaDDN part of the work for now. So, I attempted to use the ground truth depthcamera{i}.npy, to generate the pvox.npy file. However, the code provided above seems to not correctly implement the depth-to-voxel conversion, as there appear to be issues with the coordinate system transformation.

From my understanding, CaDDN produces a 2D depth map, but I’m not entirely sure if this interpretation is correct. I would greatly appreciate it if you could share how you approached this part of the task.

Mar 14 '25 06:03 muchen1021

@muchen1021 I also used the ground truth for convenience to generate the pvox.npy files, but I didn't find any problem with the coordinate system transformation, they appear okay to me. Please point out any problem I omitted.

To generate the pvox files, I assumed intuitively the required input depth_list is a list stacking 4 depth maps corresponding to the 4 cameras installed on the ego. Below is the modified code of the depth2vox function I used. I made a few changes by annotating out some lines of code whose purposes are unclear to me. I also annotated the steps related to coordinate transformation, you can check them for clarity and correctness.

# Convert depth maps to voxels:
def depth2vox(depth_list, lidar_height=2.3):
    """
    Convert a list of depth maps to a 3D voxel representation.
    
    Parameters:
        depth_list (list of np.ndarray): List of depth maps.
        lidar_height (float): Height offset for the lidar sensor (reference of coordinate system).
        
    Returns:
        np.ndarray: Voxel grid representing combined depth maps.
    """
    # depth_list: (4, C, D, H, W)
    # H, W = depth_list[0].shape[2], depth_list[0].shape[3]
    H, W = depth_list[0].shape[0], depth_list[0].shape[1]

    # Camera calibration matrix (intrinsics of the camera)
    k = np.array([[W / (2.0 * tan(100 * pi / 360.0)), 0, W / 2.0],
                  [0, W / (2.0 * tan(100 * pi / 360.0)), H / 2.0],
                  [0, 0, 1]])
    
    # 2D pixel coordinates
    pixel_length = W * H
    u_coord = np.tile(np.arange(W - 1, -1, -1), (H, 1)).reshape(pixel_length)
    v_coord = np.tile(np.arange(H - 1, -1, -1)[:, None], (1, W)).reshape(pixel_length)
    
    all_rotated_points = []
    
    for i_img, depth in enumerate(depth_list):
        # depth = depth.detach().cpu().numpy()                                    # depth: (C, D, H, W)

        # Uniform Discretization (UD), 0.4m per bin? Refer to CaDDN paper and Center3D paper
        # depth = np.argmax(depth[0], axis=0).astype(np.float32) * 0.4            # depth[0]: (D, H, W) ---> depth: (H, W)

        # theta_z represents yaw angle horizontally; translation is the offset of the camera from the origin 
        # specified by the lidar's position
        if i_img == 0:  # front
            theta_z, translation = 0, np.array([2.5, 0, 1.0 - lidar_height])
        elif i_img == 1:  # right
            theta_z, translation = -100, np.array([0, -0.3, 1.8 - lidar_height])
        elif i_img == 2:  # left
            theta_z, translation = 100, np.array([0, 0.3, 1.8 - lidar_height])
        elif i_img == 3:  # rear
            theta_z, translation = 180, np.array([-2.2, 0, 1.5 - lidar_height])
        else:
            raise ValueError("Too many images: expected at most 4.")
        
        # Rotation matrix for each orientation 
        theta_z_rad = math.radians(theta_z)
        R_z = np.array([
            [np.cos(theta_z_rad), -np.sin(theta_z_rad), 0],
            [np.sin(theta_z_rad), np.cos(theta_z_rad), 0],
            [0, 0, 1]
        ])
        
        # Project depth to 3D points
        p2d = np.array([u_coord, v_coord, np.ones_like(u_coord)]) # Generate image coordinate system indices
        p3d = np.dot(np.linalg.inv(k), p2d) * depth.flatten()     # Transform from image coordinates to the camera coordinates
        p3d = p3d.T
        mask = p3d[:, 2] < 20  # Only keep points within 20m
        p3d = p3d[mask]
        
        # Transform points with rotation and translation (extrinsics of the camera)
        rotated_points = (R_z @ p3d.T).T + translation
        all_rotated_points.extend(rotated_points)

    # Convert all accumulated rotated points to voxel grid
    vox = point2vox(np.array(all_rotated_points))
    vox[vox > 0] = 1  # Ensure occupancy value consistency
    
    return vox

I think the problem is more likely related to the shape and contents of depth_list in the depth2vox function of the provided code. Specifically, as I mentioned above, I can't understand the purpose of the following lines in the function depth2vox:

H, W = depth_list[0].shape[2], depth_list[0].shape[3]  # Why use shape[2] and shape[3]?

and

for i_img, depth in enumerate(depth_list):
    depth = depth.detach().cpu().numpy()
    depth = np.argmax(depth[0], axis=0).astype(np.float32) * 0.4  # Why take depth[0] instead of just depth?

I speculate depth_list may be a mid-layer output taken directly from CaDDN. The input depth_list might be more complex than the intuitive way we think it should be, a simple stack of 4 depth maps for the 4 cameras installed on the ego. However, I haven't looked into CaDDN to further verify my thoughts.

Mar 14 '25 12:03 GO-Loc-GO

I might also add here that the training code of the project doesn't seem to be very well written as it does not support training with a batch size of more than 1.

Specifically, if you try to modify the batch size parameter for train_data_loader and val_data_loader in the config file located at CoHFF/cohff_opv2v/config/train_config/opv2v_dataset.py, exceptions will be raised related to tensor dimension mismatch and loss computation error.

This problem was found to stem from the hard-coded logic in the main_train.py file selecting only the first element in the batch for training no matter how many samples are contained in the batch:

pbar = tqdm(enumerate(train_dataloader))
for i_iter, [input_data_dict, co_processed_label, ego_processed_label, co_grid_ind, ego_lidar_processed_label, ego_history_data_dict] in pbar:
    input_data_dict = input_data_dict[0]     # Hard-coded batch size, so modifying batch_size won't take any effect in training speed

Mar 14 '25 12:03 GO-Loc-GO

@GO-Loc-GO The other main reason for setting the batch size to 1 (I guess) is that the number of CAVs involved in the collaboration varies across different samples, and they cannot be directly processed in a batch. To keep the code simple, the author mainly uses a batch size of 1 and gradient accumulation for training.

Mar 14 '25 12:03 Zz-dong

@GO-Loc-GO Hello, I still have doubts about this part of the code.

    # Transform points with rotation and translation (extrinsics of the camera)
    rotated_points = (R_z @ p3d.T).T + translation
    all_rotated_points.extend(rotated_points)

The rotation and translation here are to merge the four camera coordinate systems in order to merge the points. But rotating along the z-axis (depth direction) here makes it impossible to unify the coordinate system. I changed R_z to R_y. Maybe I misunderstood this part of the code.

I also couldn't understand the lines in the depth2vox function you mentioned. And I also deleted that part.

Mar 14 '25 14:03 muchen1021

@muchen1021 I also used the ground truth for convenience to generate the pvox.npy files, but I didn't find any problem with the coordinate system transformation, they appear okay to me. Please point out any problem I omitted.我也为了方便使用了真实值来生成 pvox.npy 文件，但我没有发现坐标系转换有任何问题，对我来说它们看起来是正常的。请指出我遗漏的任何问题。

To generate the pvox files, I assumed intuitively the required input depth_list is a list stacking 4 depth maps corresponding to the 4 cameras installed on the ego. Below is the modified code of the depth2vox function I used. I made a few changes by annotating out some lines of code whose purposes are unclear to me. I also annotated the steps related to coordinate transformation, you can check them for clarity and correctness.为了生成 pvox 文件，我直观地假设所需的输入 depth_list 是一个列表，该列表堆叠了安装在 ego 上的 4 个相机的 4 个深度图。以下是我在使用的 depth2vox 函数中修改的代码。我对一些目的不明确的代码行进行了注释，并注释了与坐标变换相关的步骤，您可以检查它们以确认清晰度和正确性。

Convert depth maps to voxels:

def depth2vox(depth_list, lidar_height=2.3): """ Convert a list of depth maps to a 3D voxel representation.
Parameters:
    depth_list (list of np.ndarray): List of depth maps.
    lidar_height (float): Height offset for the lidar sensor (reference of coordinate system).
    
Returns:
    np.ndarray: Voxel grid representing combined depth maps.
"""
# depth_list: (4, C, D, H, W)
# H, W = depth_list[0].shape[2], depth_list[0].shape[3]
H, W = depth_list[0].shape[0], depth_list[0].shape[1]

# Camera calibration matrix (intrinsics of the camera)
k = np.array([[W / (2.0 * tan(100 * pi / 360.0)), 0, W / 2.0],
              [0, W / (2.0 * tan(100 * pi / 360.0)), H / 2.0],
              [0, 0, 1]])

# 2D pixel coordinates
pixel_length = W * H
u_coord = np.tile(np.arange(W - 1, -1, -1), (H, 1)).reshape(pixel_length)
v_coord = np.tile(np.arange(H - 1, -1, -1)[:, None], (1, W)).reshape(pixel_length)

all_rotated_points = []

for i_img, depth in enumerate(depth_list):
    # depth = depth.detach().cpu().numpy()                                    # depth: (C, D, H, W)

    # Uniform Discretization (UD), 0.4m per bin? Refer to CaDDN paper and Center3D paper
    # depth = np.argmax(depth[0], axis=0).astype(np.float32) * 0.4            # depth[0]: (D, H, W) ---> depth: (H, W)

    # theta_z represents yaw angle horizontally; translation is the offset of the camera from the origin 
    # specified by the lidar's position
    if i_img == 0:  # front
        theta_z, translation = 0, np.array([2.5, 0, 1.0 - lidar_height])
    elif i_img == 1:  # right
        theta_z, translation = -100, np.array([0, -0.3, 1.8 - lidar_height])
    elif i_img == 2:  # left
        theta_z, translation = 100, np.array([0, 0.3, 1.8 - lidar_height])
    elif i_img == 3:  # rear
        theta_z, translation = 180, np.array([-2.2, 0, 1.5 - lidar_height])
    else:
        raise ValueError("Too many images: expected at most 4.")
    
    # Rotation matrix for each orientation 
    theta_z_rad = math.radians(theta_z)
    R_z = np.array([
        [np.cos(theta_z_rad), -np.sin(theta_z_rad), 0],
        [np.sin(theta_z_rad), np.cos(theta_z_rad), 0],
        [0, 0, 1]
    ])
    
    # Project depth to 3D points
    p2d = np.array([u_coord, v_coord, np.ones_like(u_coord)]) # Generate image coordinate system indices
    p3d = np.dot(np.linalg.inv(k), p2d) * depth.flatten()     # Transform from image coordinates to the camera coordinates
    p3d = p3d.T
    mask = p3d[:, 2] < 20  # Only keep points within 20m
    p3d = p3d[mask]
    
    # Transform points with rotation and translation (extrinsics of the camera)
    rotated_points = (R_z @ p3d.T).T + translation
    all_rotated_points.extend(rotated_points)

# Convert all accumulated rotated points to voxel grid
vox = point2vox(np.array(all_rotated_points))
vox[vox > 0] = 1  # Ensure occupancy value consistency

return vox
I think the problem is more likely related to the shape and contents of depth_list in the depth2vox function of the provided code. Specifically, as I mentioned above, I can't understand the purpose of the following lines in the function depth2vox:我认为问题更可能与提供的代码中 depth2vox 函数的 depth_list 的形状和内容有关。具体来说，正如我上面提到的，我不理解函数 depth2vox 中以下几行的目的：

H, W = depth_list[0].shape[2], depth_list[0].shape[3] # Why use shape[2] and shape[3]? and 以及

for i_img, depth in enumerate(depth_list): depth = depth.detach().cpu().numpy() depth = np.argmax(depth[0], axis=0).astype(np.float32) * 0.4 # Why take depth[0] instead of just depth? I speculate depth_list may be a mid-layer output taken directly from CaDDN. The input depth_list might be more complex than the intuitive way we think it should be, a simple stack of 4 depth maps for the 4 cameras installed on the ego. However, I haven't looked into CaDDN to further verify my thoughts.我推测 depth_list 可能是一个来自 CaDDN 的中层输出。输入的 depth_list 可能比我们直观认为的要复杂，不仅仅是一个简单的由安装在 ego 上的 4 个摄像头生成的 4 个深度图的堆叠。然而，我还没有深入研究 CaDDN 来进一步验证我的想法。

Hello, I am trying to reproduce this work. But I don't know how to get the pvox file, and I can't find the relevant code in the project. If possible, can you share how to get the pvox.npy file? I will be very grateful.

Mar 31 '25 13:03 Kevin020503

@GO-Loc-GO Hello, I still have doubts about this part of the code.嗨，我对代码的这一部分仍然有疑问。
    # Transform points with rotation and translation (extrinsics of the camera)
    rotated_points = (R_z @ p3d.T).T + translation
    all_rotated_points.extend(rotated_points)
The rotation and translation here are to merge the four camera coordinate systems in order to merge the points. But rotating along the z-axis (depth direction) here makes it impossible to unify the coordinate system. I changed R_z to R_y. Maybe I misunderstood this part of the code.此处旋转和平移是为了合并四个相机坐标系以合并点。但在这里沿 z 轴（深度方向）旋转使得无法统一坐标系。我将 R_z 改为 R_y。也许我误解了这段代码的这一部分。

I also couldn't understand the lines in the depth2vox function you mentioned. And I also deleted that part.我也看不懂你提到的 depth2vox 函数中的那些行。而且我也删除了那部分。

Did you solve this problem? I am also facing this problem and cannot get the true value of occ. If you can share the complete code for generating pvox file, I will be very grateful.

Apr 01 '25 13:04 Kevin020503

@muchen1021 I also used the ground truth for convenience to generate the pvox.npy files, but I didn't find any problem with the coordinate system transformation, they appear okay to me. Please point out any problem I omitted.

To generate the pvox files, I assumed intuitively the required input depth_list is a list stacking 4 depth maps corresponding to the 4 cameras installed on the ego. Below is the modified code of the depth2vox function I used. I made a few changes by annotating out some lines of code whose purposes are unclear to me. I also annotated the steps related to coordinate transformation, you can check them for clarity and correctness.

Convert depth maps to voxels:

def depth2vox(depth_list, lidar_height=2.3): """ Convert a list of depth maps to a 3D voxel representation.
Parameters:
    depth_list (list of np.ndarray): List of depth maps.
    lidar_height (float): Height offset for the lidar sensor (reference of coordinate system).
    
Returns:
    np.ndarray: Voxel grid representing combined depth maps.
"""
# depth_list: (4, C, D, H, W)
# H, W = depth_list[0].shape[2], depth_list[0].shape[3]
H, W = depth_list[0].shape[0], depth_list[0].shape[1]

# Camera calibration matrix (intrinsics of the camera)
k = np.array([[W / (2.0 * tan(100 * pi / 360.0)), 0, W / 2.0],
              [0, W / (2.0 * tan(100 * pi / 360.0)), H / 2.0],
              [0, 0, 1]])

# 2D pixel coordinates
pixel_length = W * H
u_coord = np.tile(np.arange(W - 1, -1, -1), (H, 1)).reshape(pixel_length)
v_coord = np.tile(np.arange(H - 1, -1, -1)[:, None], (1, W)).reshape(pixel_length)

all_rotated_points = []

for i_img, depth in enumerate(depth_list):
    # depth = depth.detach().cpu().numpy()                                    # depth: (C, D, H, W)

    # Uniform Discretization (UD), 0.4m per bin? Refer to CaDDN paper and Center3D paper
    # depth = np.argmax(depth[0], axis=0).astype(np.float32) * 0.4            # depth[0]: (D, H, W) ---> depth: (H, W)

    # theta_z represents yaw angle horizontally; translation is the offset of the camera from the origin 
    # specified by the lidar's position
    if i_img == 0:  # front
        theta_z, translation = 0, np.array([2.5, 0, 1.0 - lidar_height])
    elif i_img == 1:  # right
        theta_z, translation = -100, np.array([0, -0.3, 1.8 - lidar_height])
    elif i_img == 2:  # left
        theta_z, translation = 100, np.array([0, 0.3, 1.8 - lidar_height])
    elif i_img == 3:  # rear
        theta_z, translation = 180, np.array([-2.2, 0, 1.5 - lidar_height])
    else:
        raise ValueError("Too many images: expected at most 4.")
    
    # Rotation matrix for each orientation 
    theta_z_rad = math.radians(theta_z)
    R_z = np.array([
        [np.cos(theta_z_rad), -np.sin(theta_z_rad), 0],
        [np.sin(theta_z_rad), np.cos(theta_z_rad), 0],
        [0, 0, 1]
    ])
    
    # Project depth to 3D points
    p2d = np.array([u_coord, v_coord, np.ones_like(u_coord)]) # Generate image coordinate system indices
    p3d = np.dot(np.linalg.inv(k), p2d) * depth.flatten()     # Transform from image coordinates to the camera coordinates
    p3d = p3d.T
    mask = p3d[:, 2] < 20  # Only keep points within 20m
    p3d = p3d[mask]
    
    # Transform points with rotation and translation (extrinsics of the camera)
    rotated_points = (R_z @ p3d.T).T + translation
    all_rotated_points.extend(rotated_points)

# Convert all accumulated rotated points to voxel grid
vox = point2vox(np.array(all_rotated_points))
vox[vox > 0] = 1  # Ensure occupancy value consistency

return vox
I think the problem is more likely related to the shape and contents of depth_list in the depth2vox function of the provided code. Specifically, as I mentioned above, I can't understand the purpose of the following lines in the function depth2vox:

H, W = depth_list[0].shape[2], depth_list[0].shape[3] # Why use shape[2] and shape[3]? and

for i_img, depth in enumerate(depth_list): depth = depth.detach().cpu().numpy() depth = np.argmax(depth[0], axis=0).astype(np.float32) * 0.4 # Why take depth[0] instead of just depth? I speculate depth_list may be a mid-layer output taken directly from CaDDN. The input depth_list might be more complex than the intuitive way we think it should be, a simple stack of 4 depth maps for the 4 cameras installed on the ego. However, I haven't looked into CaDDN to further verify my thoughts.

Is there any people solved this problem? I want to reproduce this method and I am also facing this problem. It seems there is not any code for training depthnet to get the detpth estimation result and projecting it into voxel.

@GO-Loc-GO If using the depth groud truth to produce the pvox file, it seems not reasonable since it will be used as the input of the model.

Jul 04 '25 14:07 hanlinwu1997