Depth-Anything
Depth-Anything copied to clipboard
How to get depth (inverse disparity)
Hi there, loved your work. I want to ask how to obtain depth (relative distance from the camera to the object). As far as I can tell, Depth Anything produces disparity (not depth). However, I tried to conver disparity to depth by depth = 1/disparity but it does not look correct. Looking forward to your answers!
I've implemented disparity -> depth remapping with dynamic range of 100 (i.e. the farthest pixel is 100x further than the closest pixel on the screen):
def on_submit(image):
original_image = image.copy()
h, w = image.shape[:2]
image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB) / 255.0
image = transform({'image': image})['image']
image = torch.from_numpy(image).unsqueeze(0).to(DEVICE)
depth = predict_depth(model, image)
depth = F.interpolate(depth[None], (h, w), mode='bilinear', align_corners=False)[0, 0]
disp1 = depth.cpu().numpy()
# clamping the farthest depth to 100x of the nearest
range1 = np.minimum (disp1.max() / (disp1.min() + 0.001), 100.0)
max1 = disp1.max()
min1 = max1 / range1
depth1 = 1 / np.maximum (disp1, min1)
depth1 = (depth1 - depth1.min()) / (depth1.max() - depth1.min())
# depth1 = np.power (depth1, 1.0 / 2.2) # optional gamma correction
depth1 = depth1 * 65535.0
raw_depth = Image.fromarray(depth1.astype('uint16'))
tmp = tempfile.NamedTemporaryFile(suffix=f'_{depth1.min():.6f}_{depth1.max():.6f}_.png', delete=False)
raw_depth.save(tmp.name)
depth2 = (depth - depth.min()) / (depth.max() - depth.min()) * 255.0
depth2 = depth2.cpu().numpy().astype(np.uint8)
colored_depth = cv2.applyColorMap(depth2, cv2.COLORMAP_INFERNO)[:, :, ::-1]
return [(original_image, colored_depth), tmp.name]
You may also uncomment gamma correction line - depending on how your 3D app imports those images Still far from perfect but better than default. And you can adjust depth range for your liking or make a Gradio input out of it. My experience with Py is quite limited so I didn't know how to do that!
In this line : depth1 = depth1 * 65535.0 where do you get the value 65535.0 from? Also do we not need to account for camera intrinsics anywhere?
In this line : depth1 = depth1 * 65535.0 where do you get the value 65535.0 from? Also do we not need to account for camera intrinsics anywhere?
It's max value for uint16 (2^16 - 1) that this normalized depth value (0...1) is converted into for subsequent writing into 16-bit PNG file (raw_depth etc)
The paper mentions that they align the ground truth and predictions by calculating the shift and translation, do we not need to do the same? (in reverse of course)
The paper mentions that they align the ground truth and predictions by calculating the shift and translation, do we not need to do the same? (in reverse of course)
I haven't read their paper, but based on their code the model outputs disparity value as a (0...1) float - which I believe doesn't have any specific metric value, it's just NN approximation of perceived closeness of smth. I don't know if it's absolute or adjusted disparity, so I'm just calculating my absolute depth as 1 / disparity from it. The multiplier here is 1 on purpose since any camera-related intrinsics would result in just linear multiplier which could be accounted for after depth map export during mesh extrusion etc. And then I'm just limiting my 1 / disparity to handle 1 / 0 situation by clamping to dynamic range.
I'm working on web tool to visualize these things conveniently, please stay tuned for public release later this week :) https://twitter.com/antmatrosov/status/1749343136713752847
I see, do the outputs you get by following this process seem accurate?
I'm testing mostly on synthetic data - like Midjourney images, etc, so there is no ground truth to compare my results to. But from my observation it's always possible to find a good combination of fov, near / far planes and power params to get geometrically sound extrusion out of generated depth. That's where we still need to use human brain to find camera intrinsics hidden inside that NN approximation
I think disparity to metric depth conversion is only possible if you know parameters like focal length; otherwise you get only scaled relative depth, thus you can't compare it with ground truth.
You may look into metric depth, which is tuned for absolute depth of KITTI and NYU datasets: https://github.com/LiheYoung/Depth-Anything/tree/main/metric_depth
@hgolestaniii I have the focal length and other camera intrinsics, with those values how do we get the metric depth?
@Nimisha-Pabbichetty Hello, have you solved this problem?
@hgolestaniii I have the focal length and other camera intrinsics, with those values how do we get the metric depth?
Even I am working on the same problem. Were you able to figure it out?
I recommend you to use "metric-depth" folder if you really want metric values. There, you can select the model fine tuned for KITTI (outdoor) or NYU (indoor) images. Pease note, if you use the outdoor model, you can't expect to get true depth values for an arbitrary dataset.
Those models were specifically trained for KITTI and provide no guarantee to produce correct metric depth data for other datasets/contents. If you wish to estimate true metric depth values for your own dataset, you should retrain it yourself for your own ground truth data (image size, focal length, ...).
论文提到他们通过计算平移和平移来对齐真实值和预测,我们不需要做同样的事情吗? (当然反过来)
我还没有读过他们的论文,但根据他们的代码,模型将视差值输出为 (0...1) 浮点数 - 我相信它没有任何特定的度量值,它只是感知接近度的 NN 近似值。我不知道它是绝对视差还是调整后视差,所以我只是将绝对深度计算为 1 / 视差。这里的乘数是有意为 1 的,因为任何与相机相关的内在函数都会导致线性乘数,这可以在网格挤出等过程中的深度图导出之后进行考虑。然后我只是限制我的 1 / 视差来处理 1 / 0 的情况通过钳位到动态范围。
我正在开发网络工具来方便地可视化这些东西,请继续关注本周晚些时候的公开发布:) https://twitter.com/antmatrosov/status/1749343136713752847
hi, I use the pretrained model in my own datasets, and I found the output of model like 400. 500. 600 float32 values. Is it you said model outputs disparity value?