Depth-Anything icon indicating copy to clipboard operation
Depth-Anything copied to clipboard

what does the pixel value of outputs meaning?

Open zhongqiu1245 opened this issue 1 year ago • 4 comments

hello! thank you for your amazing job! could you tell me what does the pixel value of outputs(before (pred-pred.max)/(pred.max - pred.min)) meaning? I know this is relative depth, but i don't know how to calculate it and meaning of it. assuming there are 2 pixels in ouput, one's value equals 3, the other one equals 21. Is that meaning the second one is far from the first one 21/3=7times? If not, how to achieve this by ouput of depthanything? thank you in advance!

zhongqiu1245 avatar Feb 02 '24 07:02 zhongqiu1245

Hi, in your case, the output values 3 and 21 do not mean the second pixel is 7x farther than the first pixel, for two reasons:

  • the output value denotes the disparity value, which can be deemed as the 1/depth. Thus, the first pixel is actually farther than the second pixel.
  • when converting the disparity to depth, the relative depth value is still correlated to real depth by an unknown scaling factor and an unknown shift value. Due to the existence of the shift, the two values can not be simply divided.

You may find our Section 3.1 helpful.

LiheYoung avatar Feb 02 '24 09:02 LiheYoung

thank you!

zhongqiu1245 avatar Feb 02 '24 13:02 zhongqiu1245

Sorry to bother you again. 1, I read your paper, do the an unknown scaling factor and an unknown shift value mean focal length (f) and distance between stereo cameras (b)? Is that meaning disparity = f*b / depth ?

2, And if it is right, if I get the real depth of one pixel of disparity map (for example, the value of one pixel of disparity map equals 3, and the real depth of this pixel equals 5 meter), can I calculate the real depth of all pixels by this ?

   5 = f*b / 3 
   f*b = 15
   real_depth_of_all_pixels =  15 / disparity_of_all_pixels

3, what is the relationship of relative depth and disparity? Isrelative depth = (disparity-disparity.max)/(disparity.max - disparity.min))? Thank you in advance, and sorry for my poor english and limtion of monocular depth estimation(I'm a newer of this field and have a great new interest in your amazing work).

抱歉再次打扰您。 1、我读了您的论文,未知的缩放因子未知的移位值 是否是着双目相机焦距(f)双目相机之间的间距(b)? 这是否意味着disparity = f*b / depth

2、如果是的话,如果我得到disparity map的一个像素的真实depth(例如,disparity map的一个像素的值=3,其真实depth=5米),我是否可以计算出所有disparity map的像素的真实的depth? 即

   5 = f*b / 3 
   f*b = 15
   real_depth_of_all_pixels =  15 / disparity_of_all_pixels

3、以及relative depthdisparity之间的关系是什么?是relative depth = (disparity-disparity.max)/(disparity.max - disparity.min))吗? 非常感谢您,并对我糟糕的英语和对单目深度估计知识的的匮乏表示歉意(我是这个领域的新手,但对您的出色工作产生了很大的新兴趣)。

zhongqiu1245 avatar Feb 03 '24 16:02 zhongqiu1245

There's a similar discussion on the MiDaS issue board.

The formula would be something like: true depth = 1 / (A + B*depth_anything_result) Where A and B are the unknown shift & scale factors, respectively.

Based on this formula, if you only have 1 pixel value, it won't be enough to figure out both unknowns. If you have 2 pixel values, I guess it could be enough, but it will likely be very error prone. That same issue references another post where someone was trying to do a least squares fit between a known depth map and the disparity map, which seems like a more robust approach.

heyoeyo avatar Feb 03 '24 21:02 heyoeyo

Thank @heyoeyo for answering! The formula may be more precise if modified to:

true depth = A * (1 / depth_anything_output) + B

A is the unknown scaling factor and B is the unknown shift.

Just as @heyoeyo said, you need at least two true depth values to estimate A and B. Of course, more true depth values will bring more precise estimation of A and B.

LiheYoung avatar Feb 04 '24 03:02 LiheYoung

Thank you @heyoeyo @LiheYoung You guys help me a lot! Acorrding this, the relative depth is disparity?

zhongqiu1245 avatar Feb 04 '24 04:02 zhongqiu1245

You can consider the relative depth as 1 / disparity.

LiheYoung avatar Feb 04 '24 05:02 LiheYoung

ok @LiheYoung the A and B which you mention are only defined by camera? I mean, they are constast for a determine camera?not affect by angle, speed of camera , or sence of world?

zhongqiu1245 avatar Feb 04 '24 08:02 zhongqiu1245

How are we calculating A and B here ? Also metric_depth by DA different than relative depth ? @LiheYoung

Eyshika avatar Feb 05 '24 15:02 Eyshika

Hi @LiheYoung, I am interested in your work and would like to know more about its output. Are the model's output values in meters or something like that?

hoangtnm avatar Feb 05 '24 15:02 hoangtnm

@hoangtnm Only the metric model's (model finetuned on kitti dataset) output is metric, meaning in meters. The other models output relative disparity (inverse depth) upto an unknown scale and unknown shift which needs to be computed using GT for example

kishore-greddy avatar Feb 06 '24 16:02 kishore-greddy

@Eyshika A & B can be calculated using the script here : https://gist.github.com/ranftlr/45f4c7ddeb1bbb88d606bc600cab6c8d. It is from the author of MiDAS if I am not wrong and they also predict relative disparity

kishore-greddy avatar Feb 06 '24 16:02 kishore-greddy

Thank you guys! You did a great help! @heyoeyo @LiheYoung

zhongqiu1245 avatar Feb 11 '24 07:02 zhongqiu1245

@zhongqiu1245 In case you didn't get an answer to:

the A and B which you mention are only defined by camera? ...

For the formula: true depth = 1 / (A + B*depth_anything_result)

If I understand the mapping correctly, then if you normalize the depth-anything result (between 0 and 1), the values for A and B should be something like:

A = 1 / max true depth B = (1 / min true depth) - (1 / max true depth)

So the answer would be that the values are specific to individual images (in general). They are not directly related to the camera, but instead depend on the depth of the closest and farthest parts of the image (at least in theory). You can find a more detailed explanation here: https://github.com/heyoeyo/muggled_dpt/blob/main/.readme_assets/results_explainer.md

Edit: Updated link

heyoeyo avatar Feb 11 '24 22:02 heyoeyo

Thank you!@heyoeyo

zhongqiu1245 avatar Feb 12 '24 08:02 zhongqiu1245

Thank @heyoeyo for answering! The formula may be more precise if modified to:

true depth = A * (1 / depth_anything_output) + B

A is the unknown scaling factor and B is the unknown shift.

Just as @heyoeyo said, you need at least two true depth values to estimate A and B. Of course, more true depth values will bring more precise estimation of A and B.

It seems that the output of the model have not been normed,“depth_anything_output” you mentioned here should be the original output or its value after normed(d-dmin/dmax-dmin)?

Zhanfury avatar Jun 14 '24 07:06 Zhanfury

“depth_anything_output” you mentioned here should be the original output or its value after normed

It would have to be normalized for those equations to work. The 'raw' output has an arbitrary scale/shift and it varies by model (i.e. the small/base/large variants), so it's hard to work with without normalization.

heyoeyo avatar Jun 15 '24 13:06 heyoeyo

“depth_anything_output” you mentioned here should be the original output or its value after normed

It would have to be normalized for those equations to work. The 'raw' output has an arbitrary scale/shift and it varies by model (i.e. the small/base/large variants), so it's hard to work with without normalization.

Thanks.I've read your insights about the mapping here:https://github.com/heyoeyo/muggled_dpt/blob/main/.readme_assets/results_explainer.md and noticed that the mapping you understand is different from the author's. Yours: true depth = 1 / (A + B*depth_anything_result); The author's: true depth = A * (1 / depth_anything_output) + B

How do you think about the difference,which one may be more reasonable?

Zhanfury avatar Jun 16 '24 02:06 Zhanfury

How do you think about the difference,which one may be more reasonable?

I originally figured the two equations were the same, but was corrected (on the MiDaS issue page).

The True Depth = (A/depth_anything) + B form looks a lot more elegant, but assuming the depth anything output is normalized, it doesn't make sense if the true depth is finite. More specifically, the maximum true depth value would be mapped an inverse value (after normalization) of 0, so you'd get: Max true depth = (A / 0) + B And there is no A, B that can make this work. So I tend to think the 1/(A + B*d) version is better. It also seems to properly reverse the equation used in the original MiDaS paper (eq. 6), which I assume is applicable to the depth_anything result.

heyoeyo avatar Jun 16 '24 20:06 heyoeyo