Metric3D
Metric3D copied to clipboard
How to improve the metric depth value?
Hi, I need to get distance to an object, I gathered a small dataset of outdoor images at a varied distances to test, and the model results are varied, My Questions are:
What is the best practice to improve the results?, I already calibrated and have the intrinsics, what else can I do? the model is clipped to be under a certain value to account for the sky correct?
My images are w=3024, h=4032, I provided the code I use to generate depth and visualization below
The red dot point is 4m from the camera, but I got (11.45) from the vit-small model and (20.764) from vit-large model, obviously way off. another test I ran at 2m produced (1.3) fro, vit_small and (1.5) for vit_large, which is still not ideal but workable.
rgb_file = '/content/MG_5u_4m.jpg'
input_size = (616, 1064)
intrinsic = [3000, 3000, 1529.95662, 1976.17563] # camera's intrinsic parameters
padding_values = [123.675, 116.28, 103.53]
# Load and preprocess image
rgb_origin = cv2.imread(rgb_file)[:, :, ::-1]
# Adjust input size to fit the model
h, w = rgb_origin.shape[:2]
scale = min(input_size[0] / h, input_size[1] / w)
rgb = cv2.resize(rgb_origin, (int(w * scale), int(h * scale)), interpolation=cv2.INTER_LINEAR)
# Scale intrinsic parameters
intrinsic = [intrinsic[0] * scale, intrinsic[1] * scale, intrinsic[2] * scale, intrinsic[3] * scale]
# Padding
h, w = rgb.shape[:2]
pad_h = input_size[0] - h
pad_w = input_size[1] - w
pad_h_half = pad_h // 2
pad_w_half = pad_w // 2
rgb = cv2.copyMakeBorder(rgb, pad_h_half, pad_h - pad_h_half, pad_w_half, pad_w - pad_w_half, cv2.BORDER_CONSTANT, value=padding_values)
pad_info = [pad_h_half, pad_h - pad_h_half, pad_w_half, pad_w - pad_w_half]
# Normalize
mean = torch.tensor([123.675, 116.28, 103.53]).float()[:, None, None]
std = torch.tensor([58.395, 57.12, 57.375]).float()[:, None, None]
rgb = torch.from_numpy(rgb.transpose((2, 0, 1))).float()
rgb = torch.div((rgb - mean), std)
rgb = rgb[None, :, :, :].cuda()
# Load model
model = torch.hub.load('yvanyin/metric3d', 'metric3d_vit_small', pretrain=True)
model.cuda().eval()
# cuda()
# Perform inference
with torch.no_grad():
pred_depth, confidence, output_dict = model.inference({'input': rgb})
# un pad
pred_depth = pred_depth.squeeze()
pred_depth = pred_depth[pad_info[0] : pred_depth.shape[0] - pad_info[1], pad_info[2] : pred_depth.shape[1] - pad_info[3]]
# upsample to original size
pred_depth = torch.nn.functional.interpolate(pred_depth[None, None, :, :], rgb_origin.shape[:2], mode='bilinear').squeeze()
###################### canonical camera space ######################
#### de-canonical transform
canonical_to_real_scale = intrinsic[0] / 1000.0 # 1000.0 is the focal length of canonical camera
pred_depth = pred_depth * canonical_to_real_scale # now the depth is metric
pred_depth = torch.clamp(pred_depth, 0, 300)
Any info to get this close to the real world scale is appreciated