FoundationPose Preprocessing of depth image for model-based inference

Hi, I have prepared a .obj model for a teacup and recorded a sequence of RGB and depth image sequence. All the needed files are organized like the provided demo data. However, the inference of the teacup is totally incorrect.

Maybe there is something wrong with the depth image preprocessing. I have a realsense d435i camera and have scaled the depth image like the linemod dataset (value of a pixel equals millimeter in real world).

Could you specify the preprocessing of depth image?

Apr 10 '24 13:04 Strauss-Wen

https://github.com/NVlabs/FoundationPose/issues/25#issuecomment-2037719050 can you try the suggestions there?

Apr 10 '24 17:04 wenbowen123

Hi @Trevor-wen,

maybe I can help, as I also had to solve some problems when first using FoundationPose on own objects. :sweat_smile:

1. Make sure that you CAD model is scaled in meters as mesh units

Unlike other methods that use mm as the mesh unit, FoundationPose uses meters.

Example if he mesh units are wrong (in mm):

https://github.com/NVlabs/FoundationPose/assets/42057206/981beb58-bb5c-4b6b-a5c7-ddf09c91c425

2. RGB and depth images must be aligned

The captured RGB and depth frames must be aligned. How this is done depends on the sensor used.

The following Python script is adapted from librealsense and can be used to record aligned and unaligned frames with a RealSense (I used it for the examples) :

record_realsense_foundationpose.py

## License: Apache 2.0. See LICENSE file in root directory.
## Copyright(c) 2017 Intel Corporation. All Rights Reserved.

#####################################################
##              Align Depth to Color               ##
#####################################################

import pyrealsense2 as rs
import numpy as np
import cv2
import json
import time
import os

# Create a pipeline
pipeline = rs.pipeline()

# Create a config and configure the pipeline to stream
# different resolutions of color and depth streams
config = rs.config()

# Get device product line for setting a supporting resolution
pipeline_wrapper = rs.pipeline_wrapper(pipeline)
pipeline_profile = config.resolve(pipeline_wrapper)
device = pipeline_profile.get_device()
device_product_line = str(device.get_info(rs.camera_info.product_line))

found_rgb = False
for s in device.sensors:
    if s.get_info(rs.camera_info.name) == "RGB Camera":
        found_rgb = True
        break
if not found_rgb:
    print("The demo requires Depth camera with Color sensor")
    exit(0)

config.enable_stream(rs.stream.depth, 640, 480, rs.format.z16, 30)

if device_product_line == "L500":
    config.enable_stream(rs.stream.color, 960, 540, rs.format.bgr8, 30)
else:
    config.enable_stream(rs.stream.color, 640, 480, rs.format.bgr8, 30)

# Start streaming
profile = pipeline.start(config)

# Getting the depth sensor's depth scale (see rs-align example for explanation)
depth_sensor = profile.get_device().first_depth_sensor()
depth_scale = depth_sensor.get_depth_scale()
print("Depth Scale is: ", depth_scale)

# We will be removing the background of objects more than
#  clipping_distance_in_meters meters away
clipping_distance_in_meters = 1  # 1 meter
clipping_distance = clipping_distance_in_meters / depth_scale

# Create an align object
# rs.align allows us to perform alignment of depth frames to others frames
# The "align_to" is the stream type to which we plan to align depth frames.
align_to = rs.stream.color
align = rs.align(align_to)

# Get the absolute path to the subfolder
script_dir = os.path.dirname(os.path.abspath(__file__))
subfolder_depth = os.path.join(script_dir, "out/depth")
subfolder_rgb = os.path.join(script_dir, "out/rgb")
subfolder_depth_unaligned = os.path.join(script_dir, "out/depth_unaligned")
subfolder_rgb_unaligned = os.path.join(script_dir, "out/rgb_unaligned")

# Check if the subfolder exists, and create it if it does not
if not os.path.exists(subfolder_depth):
    os.makedirs(subfolder_depth)
if not os.path.exists(subfolder_rgb):
    os.makedirs(subfolder_rgb)
if not os.path.exists(subfolder_depth_unaligned):
    os.makedirs(subfolder_depth_unaligned)
if not os.path.exists(subfolder_rgb_unaligned):
    os.makedirs(subfolder_rgb_unaligned)

# Create all 

RecordStream = False

# Streaming loop
try:
    while True:
        # Get frameset of color and depth
        frames = pipeline.wait_for_frames()
        # frames.get_depth_frame() is a 640x360 depth image

        # Align the depth frame to color frame
        aligned_frames = align.process(frames)

        # Get aligned frames
        aligned_depth_frame = (
            aligned_frames.get_depth_frame()
        )  # aligned_depth_frame is a 640x480 depth image
        color_frame = aligned_frames.get_color_frame()

        unaligned_depth_frame = frames.get_depth_frame()
        unaligned_color_frame = frames.get_color_frame()

        # Get instrinsics from aligned_depth_frame
        intrinsics = aligned_depth_frame.profile.as_video_stream_profile().intrinsics

        # Validate that both frames are valid
        if not aligned_depth_frame or not color_frame:
            continue

        depth_image = np.asanyarray(aligned_depth_frame.get_data())
        color_image = np.asanyarray(color_frame.get_data())

        # Remove background - Set pixels further than clipping_distance to grey
        grey_color = 153
        depth_image_3d = np.dstack(
            (depth_image, depth_image, depth_image)
        )  # depth image is 1 channel, color is 3 channels
        bg_removed = np.where(
            (depth_image_3d > clipping_distance) | (depth_image_3d <= 0),
            grey_color,
            color_image,
        )

        unaligned_depth_image = np.asanyarray(unaligned_depth_frame.get_data())
        unaligned_rgb_image = np.asanyarray(unaligned_color_frame.get_data())

        # Render images:
        #   depth align to color on left
        #   depth on right
        depth_colormap = cv2.applyColorMap(
            cv2.convertScaleAbs(depth_image, alpha=0.03), cv2.COLORMAP_JET
        )
        images = np.hstack((color_image, depth_colormap))

        cv2.namedWindow("Align Example", cv2.WINDOW_NORMAL)
        cv2.imshow("Align Example", images)

        key = cv2.waitKey(1)

        # Start saving the frames if space is pressed once until it is pressed again
        if key & 0xFF == ord(" "):
            if not RecordStream:
                time.sleep(0.2)
                RecordStream = True

                with open(os.path.join(script_dir, "out/cam_K.txt"), "w") as f:
                    f.write(f"{intrinsics.fx} {0.0} {intrinsics.ppx}\n")
                    f.write(f"{0.0} {intrinsics.fy} {intrinsics.ppy}\n")
                    f.write(f"{0.0} {0.0} {1.0}\n")

                print("Recording started")
            else:
                RecordStream = False
                print("Recording stopped")

        if RecordStream:
            framename = int(round(time.time() * 1000))

            # Define the path to the image file within the subfolder
            image_path_depth = os.path.join(subfolder_depth, f"{framename}.png")
            image_path_rgb = os.path.join(subfolder_rgb, f"{framename}.png")
            image_path_depth_unaligned = os.path.join(subfolder_depth_unaligned, f"{framename}.png")
            image_path_rgb_unaligned = os.path.join(subfolder_rgb_unaligned, f"{framename}.png")

            cv2.imwrite(image_path_depth, depth_image)
            cv2.imwrite(image_path_rgb, color_image)
            cv2.imwrite(image_path_depth_unaligned, unaligned_depth_image)
            cv2.imwrite(image_path_rgb_unaligned, unaligned_rgb_image)

        # Press esc or 'q' to close the image window
        if key & 0xFF == ord("q") or key == 27:

            cv2.destroyAllWindows()

            break
finally:
    pipeline.stop()

Example if the RBG and depth frames are not aligned properly:

https://github.com/NVlabs/FoundationPose/assets/42057206/92f7f4c0-0733-4e8b-b96b-fd5d1e0f3cca

3. Wrong sensor intrinsics

Make sure that you use the correct instrinsics in the following format (RealSense with pyrealsense2, see code above):

intrinsics.fx 0.0 intrinsics.ppx
0.0 intrinsics.fy intrinsics.ppy
0.0 0.0 1.0

Example of very wrong intrinsics:

https://github.com/NVlabs/FoundationPose/assets/42057206/48f398bf-66bd-4bfa-bbd6-165cef7f525c

4. Impressive pose estimation when everything is done right

Example if everything works fine:

https://github.com/NVlabs/FoundationPose/assets/42057206/c4ec2ca1-29eb-40a1-8811-310a1be58ec7

Edit: @wenbowen123 was quicker, but maybe it still helps. :smiley:

Apr 10 '24 17:04 savidini

@savidini Could you tell me how to modify the unit of YCB-Video objects CAD models?

Apr 11 '24 02:04 ethanshenze

@Ethan-shen-lab you can use software like MeshLab to manually scale down objects: Import > Filters > Normals, ... > Transform: Scale, ... > Check Uniform and scale one axis > Export

You can also use packages like for example trimesh to do this in Python, see simple example below:

import trimesh
mesh = trimesh.load('path_to_your_file.obj')
mesh.apply_scale(0.001)
mesh.export('scaled_down_file.obj')

I am not sure what you want to do (run_ycb_video.py or run_demo.py with custom data?). However, it seems that there are two versions of the YCB-V datasets, the BOP version is converted to millimeters, but the original version uses meters and I suppose it should work without any changes in FoundationPose.

Apr 11 '24 09:04 savidini

@Ethan-shen-lab you can use software like MeshLab to manually scale down objects: Import > Filters > Normals, ... > Transform: Scale, ... > Check Uniform and scale one axis > Export

You can also use packages like for example trimesh to do this in Python, see simple example below:
import trimesh
mesh = trimesh.load('path_to_your_file.obj')
mesh.apply_scale(0.001)
mesh.export('scaled_down_file.obj')
I am not sure what you want to do (run_ycb_video.py or run_demo.py with custom data?). However, it seems that there are two versions of the YCB-V datasets, the BOP version is converted to millimeters, but the original version uses meters and I suppose it should work without any changes in FoundationPose.

thank you very much! And I have another question. I used the realsense code you provided to collect image data, but after running the run_demo.py, an error like this occurred: RuntimeError: Cuda error: 2[cudaMalloc(&m_gpuPtr, bytes);]. So I suspected that there was a problem with my own data, because I could run it successfully using the officially provided image data. Can you share the data you collected? I want to verify if my guess is correct.

Apr 11 '24 11:04 ethanshenze

@ethanshenze Example data with the Rubik's Cube used for the last video in my comment above. (Setup: RTX4090 and Docker with CUDA 12.1 as described in #27)

Apr 11 '24 12:04 savidini

@ethanshenze Example data with the Rubik's Cube used for the last video in my comment above. (Setup: RTX4090 and Docker with CUDA 12.1 as described in #27)

I really appreciate your help！

Apr 11 '24 12:04 ethanshenze

@savidini Thanks a lot for the detailed instruction! I will try them as soon as possible. Really appreciate for the help!

Apr 11 '24 15:04 Strauss-Wen

You can use either of MeshLab or Blender to scale down by 0.001 along each axis ethanshenze

Apr 11 '24 17:04 monajalal

You can use either of MeshLab or Blender to scale down by 0.001 along each axis ethanshenze

copy that! I will try it and thanks you very much~

Apr 12 '24 03:04 ethanshenze

@savidini @wenbowen123 Thanks for your instructions before!

However, after I tried the steps, I am facing an issue that the bounding box is too small and not following the object (banana)

Could you help with it and tell me what things could I do? Thanks in advance!

More Context:

The screenshot and files in the debug folder are attached below.

debug folder: https://drive.google.com/drive/folders/1bDfOyJq7fFKyRybSMrxkKaP9_HN6Xr6N?usp=sharing

I am using the banana CAD model from YCBV official website, a link from Bowen's previous repo https://github.com/wenbowen123/iros20-6d-pose-tracking

I have checked the scene_complete.ply file by visualizer and it seems fine to me (so I assume the depth images are ok?)

I have checked the model.obj file and it seems fine to me

Apr 16 '24 06:04 aThinkingNeal

@aThinkingNeal can you maybe provide your intrinsics/cam_K.txt file? The model seems fine, but I was able to reproduce a somewhat similar behavior using wrong units for the intrinsics:

https://github.com/NVlabs/FoundationPose/assets/42057206/c29c59b3-3eaa-4fc2-9edf-8484c601d8a1

Apr 16 '24 10:04 savidini

@savidini Thanks for the advice! I have callibrated the cam_K.txt file and got the bounding box size back to normal.

However, the pose estimation seems to be wandering off after the first frame. I am using an apple as the object and put the debug info in the following folder:

https://drive.google.com/drive/folders/1BxYq0pwn7ROqTe-IhSWgmc2gw_cGN8WF?usp=sharing

The behavior is like the images below, the first frame is fine, then the boudning box starts to drift away, even though the object is not even moving:

Apr 16 '24 12:04 aThinkingNeal

@aThinkingNeal from the images of your debug output, it looks like there are several "skips" in the images you recorded, i.e. after img_1.png and img_54.png. Is this correct? And if so, are you running the inference on all images with the default run_demo.py?

Below is a video showing the effect of "skipping" frames, resulting in sudden changes in the tracked object:

https://github.com/NVlabs/FoundationPose/assets/42057206/949bcd4c-37a1-4d39-9a9b-b7f3b91dfc76

Apparently this cannot be handled by FoundationPose's tracking (although the correct pose will eventually be correct if enough frames are provided after a skip). This behavior is somewhat different from other methods that do not use tracking, but instead re-run the pose estimation on every frame.

If my assumption is correct, but you can't avoid these skips in your input, see #37 for running the pose estimation on every frame.

Apr 16 '24 16:04 savidini

Hi @Trevor-wen,

maybe I can help, as I also had to solve some problems when first using FoundationPose on own objects. 😅

1. Make sure that you CAD model is scaled in meters as mesh units

Unlike other methods that use mm as the mesh unit, FoundationPose uses meters.

Example if he mesh units are wrong (in mm): 44_wrong_scale.mp4

2. RGB and depth images must be aligned

The captured RGB and depth frames must be aligned. How this is done depends on the sensor used.

The following Python script is adapted from librealsense and can be used to record aligned and unaligned frames with a RealSense (I used it for the examples) : record_realsense_foundationpose.py

Example if the RBG and depth frames are not aligned properly: 44_unaligned.mp4

3. Wrong sensor intrinsics

Make sure that you use the correct instrinsics in the following format (RealSense with pyrealsense2, see code above):
intrinsics.fx 0.0 intrinsics.ppx
0.0 intrinsics.fy intrinsics.ppy
0.0 0.0 1.0
Example of very wrong intrinsics: 44_intrinsics.mp4

4. Impressive pose estimation when everything is done right

Example if everything works fine: 44_correct.mp4

Edit: @wenbowen123 was quicker, but maybe it still helps. 😃

your banana model seems wrong in the scale. It's 2 meters long.

Apr 16 '24 21:04 wenbowen123

@aThinkingNeal the file names need to be padded with 0 in front to make a fixed num of digits (see our example data)

Apr 16 '24 21:04 wenbowen123

@wenbowen123 @savidini Thanks for the help!

I think my problem is solved by:

adjusting the file names
calibrate the cam.K file
Use the .obj model provided by the YCBV official dataset

Now I am facing another issue about how to get an accurate custom CAD model, but I will ask it in another issue

Thanks again for your help!

Apr 17 '24 08:04 aThinkingNeal

@savidini @wenbowen123 Thanks for your instructions before!

However, after I tried the steps, I am facing an issue that the bounding box is too small and not following the object (banana)

Could you help with it and tell me what things could I do? Thanks in advance!

More Context:

The screenshot and files in the debug folder are attached below.

debug folder: https://drive.google.com/drive/folders/1bDfOyJq7fFKyRybSMrxkKaP9_HN6Xr6N?usp=sharing

I am using the banana CAD model from YCBV official website, a link from Bowen's previous repo https://github.com/wenbowen123/iros20-6d-pose-tracking

I have checked the scene_complete.ply file by visualizer and it seems fine to me (so I assume the depth images are ok?)

I have checked the model.obj file and it seems fine to me

Hello! Thanks for the code, it has helped me to test it with my data. A couple of questions:

When using my Realsense D415, it performs well the pose estimation but the bounding box vibrates a little. I get the RGB, Depth and intrinsic images from the code. The SAM mask and the Mesh from the CAD of the object. What can it be?
When I switch to my Realsense L515 (Lidar) and repeat all the steps, the bounding box it generates is very small and does not seem to predict the movements as the other camera did. I do exactly the same. Is something missing?

Thanks in advance!

Sep 03 '24 10:09 Zialo