MiDaS icon indicating copy to clipboard operation
MiDaS copied to clipboard

DEPTH VALUE OF THE EACH PIXEL

Open AdnanErdogn opened this issue 7 months ago • 12 comments

How can i reach this information on midas ?

AdnanErdogn avatar Dec 31 '23 13:12 AdnanErdogn

The midas models output inverse depth maps (or images). So each pixel of the output corresponds to a value like: 1/depth

However, the mapping is also only relative, it doesn't tell you the exact (absolute) depth. Aside from noise/errors, the true depth value is shifted/scaled compared to the result you get from the midas output after inverting, so more like:

~~true depth = A + B * (1 / midas output)~~ (see post below)

Where A is some offset and B is some scaling factor, that generally aren't knowable using the midas models alone. You can try something like ZoeDepth to get actual depth values or otherwise try fitting the midas output to some other reference depth map, like in issue #171

heyoeyo avatar Jan 04 '24 22:01 heyoeyo

According to #171 I believe the equation is: (1.0 / true_depth) = A + (B * midas_output) so then true_depth = 1.0 / (A + (B * midas_output))

JoshMSmith44 avatar Jan 07 '24 07:01 JoshMSmith44

so then true_depth = 1.0 / (A + (B * midas_output))

Good point! I was thinking these are the same mathematically, but there is a difference, and having the shifting done before inverting makes more sense.

heyoeyo avatar Jan 07 '24 17:01 heyoeyo

How are A and B calculated for a video ? @JoshMSmith44

Eyshika avatar Feb 05 '24 15:02 Eyshika

How are A and B calculated for a video ? @JoshMSmith44

I believe MiDas is a single-image method and therefore there is a different A and B for each frame in the video sequence.

JoshMSmith44 avatar Feb 05 '24 17:02 JoshMSmith44

How are A and B calculated for a video ? @JoshMSmith44

I believe MiDas is a single-image method and therefore there is a different A and B for each frame in the video sequence.

but in MIDAS it calculates using, true depth and calculated depth comparison. What if we have completely new images and want to find metric depth ?

Eyshika avatar Feb 06 '24 15:02 Eyshika

In order to get the true depth using the above method you need to know two true depth pixel values for each relative depth image you correct (realistically you want many more). This could come from a a sensor, sparse structure-from-motion point cloud, etc. if you don't have access to true depth and you need access to metric depth then you should look into Metric depth estimation methods like ZoeDepth, Depth-Anything,and ZeroDepth.

JoshMSmith44 avatar Feb 06 '24 18:02 JoshMSmith44

midas 模型输出逆深度图(或图像)。因此输出的每个像素对应一个值,例如:1/depth

然而,映射也只是_相对的_,它并不能告诉你确切的(绝对)深度。除了噪声/误差之外,与反转后从 midas 输出获得的结果相比,真实深度值会发生偏移/缩放,因此更像:

~真实深度 = A + B * (1 / midas 输出)~ (见下面的帖子)

哪里A是一些偏移量,哪里B是一些比例因子,通常仅使用 midas 模型是无法得知的。您可以尝试类似ZoeDepth 的方法来获取实际深度值,或者尝试将 midas 输出拟合到其他参考深度图,如问题#171中所示

hi, if I just use midas output, which you said the inverse depth, to train my model. I want to get the relative depth for an image. Am I do something wrong?

puyiwen avatar Apr 17 '24 07:04 puyiwen

The midas models output inverse depth maps (or images). So each pixel of the output corresponds to a value like: 1/depth

However, the mapping is also only relative, it doesn't tell you the exact (absolute) depth. Aside from noise/errors, the true depth value is shifted/scaled compared to the result you get from the midas output after inverting, so more like:

~true depth = A + B * (1 / midas output)~ (see post below)

Where A is some offset and B is some scaling factor, that generally aren't knowable using the midas models alone. You can try something like ZoeDepth to get actual depth values or otherwise try fitting the midas output to some other reference depth map, like in issue #171

Hi, @heyoeyo, I want to know how the metric depth dataset(like DIML) and relative depth dataset(like RedWeb) to train together? Dose change metric depth dataset to relative depth dataset first? Can you help me? Thank you very much!!

puyiwen avatar May 07 '24 02:05 puyiwen

One of the MiDaS papers describes how the data is processed for training. The explanation starts on page 5, under the section: Training on Diverse Data

There they describe several approaches they considered, which are later compared on plots (see page 7) showing that the combination of the 'ssitrim + reg' loss functions worked the best. These loss functions are both described on page 6 (equations 7 & 11).

The explanation just above the 'ssitrim' loss is where they describe how different data sets are handled. The basic idea is that they first run their model on an input image to get a raw prediction, which is then normalized (using equation 6 in the paper). They repeat the same normalization procedure for the ground truth, and then calculate the error as: abs(normalized_prediction - normalized_ground_truth_disparity) Which is calculated for each 'pixel' in the prediction and summed together. For the 'ssitrim' loss specifically, they ignore the top 20% largest errors when calculating the sum.

So due to the normalization step, both relative & metric depth data sources should be able to be processed/trained using the same procedure.

heyoeyo avatar May 07 '24 13:05 heyoeyo

@heyoeyo , thank you for your reply. And I have another question about relative depth evaluation. Why the output of model( relative depth) should be converted to metric depth, and evaluate at metric depth dataset, like NYU, KITTY, using the rmse、abs_rel.eg.? Why not just use the relative depth dataset for evaluation?

puyiwen avatar May 15 '24 06:05 puyiwen

I think it depends on what the evaluation is trying to show. Converting to metric depth would have the effect of more heavily weighting errors on scenes that have wider depth ranges. For example a 10% error on an indoor scene with elements that are only 10m away would be a 1m error, whereas a 10% error on an outdoor scene with objects 100m away would have a 10m error, and that might be something the authors want to prioritize (i.e. model accuracy across very large depth ranges).

It does some strange to me that the MiDaS paper converted some results to metric depth for their experiments section though. Since it seems they just used a least squares fit to align the relative depth results with the metric ground truth (described on pg 7), it really feels like this just over-weights the performance of the model on outdoor scenes.

It makes a lot more sense to do the evaluation directly in absolute depth for something like ZoeDepth, where the model is directly predicting the metric values and therefore those 1m vs 10m errors are actually relevant to the model's capability. (but I might be missing something, I haven't really worked with metric depth data myself)

heyoeyo avatar May 15 '24 20:05 heyoeyo