dinov2 icon indicating copy to clipboard operation
dinov2 copied to clipboard

[request] Depth estimation documentation, training code and / or model weights

Open patricklabatut opened this issue 1 year ago • 37 comments

Related issues:

  • #6
  • #14
  • #46
  • #97

patricklabatut avatar Apr 24 '23 22:04 patricklabatut

Can you give an expected timeframe on when depth estimation will be available?

kfzyqin avatar Apr 27 '23 07:04 kfzyqin

I'd also be interested to hear.

tnarek avatar Apr 27 '23 14:04 tnarek

I'd also be happy if you could share the semantic segmentation heads. The one that produces the results on the web demo. Thx!

yuvfried avatar Apr 30 '23 10:04 yuvfried

Would be excellent to obtain depth estimation output per image. Supportive of this enhancement!

hblanken avatar May 04 '23 02:05 hblanken

segmentation head similar to the demo please

mirlansmind avatar May 04 '23 17:05 mirlansmind

Also interested in acquiring depth info per image, really cool!

stofe95 avatar May 07 '23 13:05 stofe95

Also very interested to have the depth estimation head model documentation (and model/weights if possible).

jonathan-besuchet avatar May 11 '23 13:05 jonathan-besuchet

@patricklabatut Thank you so much for the main code. Would you please update us about the timeline of delivering the depth-estimation code as well. Please let us know if any help is needed.

shahabe avatar May 16 '23 01:05 shahabe

Could you please release the segmentation part?

wuzihaoo avatar May 19 '23 18:05 wuzihaoo

Could you please release the segmentation part?

  • #55

woctezuma avatar May 19 '23 20:05 woctezuma

Very interested and waiting for your release!

ttppss avatar May 24 '23 14:05 ttppss

Cool!

imbinwang avatar May 26 '23 09:05 imbinwang

very interested in releasing the depth estimation head

bloodhunt3r avatar May 29 '23 11:05 bloodhunt3r

Interested in depth estimation head as well (or any documentation on how to reproduce the results using provided models)

kootsZhin avatar May 29 '23 15:05 kootsZhin

Interested in the depth part also!

ray8828 avatar May 29 '23 20:05 ray8828

@patricklabatut could you maybe shed some light on the decision to not release the depth estimation parts immediately? I'm not much into deep learning research, but if you trained and tested it, is it a lot of effort to just publish it? Or am I to naive?

JuliusJacobsohn avatar May 31 '23 10:05 JuliusJacobsohn

@patricklabatut amazing work! any approximate timeline on if/when a trained depth estimation head could be released?

Ale-Burzio avatar Jun 06 '23 09:06 Ale-Burzio

I would love to learn the news about the depth

leesunfreshing avatar Jun 09 '23 16:06 leesunfreshing

I would also appreciate an example code for depth estimation. Can't do much with the model's output embeddings yet. Thanks!

kanishkanarch avatar Jun 10 '23 15:06 kanishkanarch

Very interested in the depth estimation code! I tried to add linear head but actually I don't know how to convert the (batch_size, num_of_tokens, feature_dim) tensor to (batch_size, 256 image_width, image_height) to get the paper's result on SUNRGBD.

Cindy0725 avatar Jun 13 '23 09:06 Cindy0725

Would appreciate greatly if your pre-trained depth estimator/optical flow model is released! Can't wait to try it on my videos!

fumin avatar Jun 13 '23 13:06 fumin

Would appreciate greatly if your pre-trained depth estimator/optical flow model is released!

Thanks for your interest. Please note that we don't have an optical flow model (although one could leverage the provided backbones to train a matching head for this task).

patricklabatut avatar Jun 13 '23 21:06 patricklabatut

Would be awesome if someone train a depth estimation head on top of the provided backbone (dinov2_vitl14_pretrain.pth). Any thoughts on who/how and estimated eta?

hblanken avatar Jun 18 '23 03:06 hblanken

I would also like to request an estimated release date for the depth estimation pre-train head. Thank you.

chenshihfang avatar Jun 21 '23 15:06 chenshihfang

Two questions about the "DPT decoder" mentioned in 7.3 Dense Recognition Tasks-Depth estimation part. I search for the DPT source code, do the "DPT decoder" refers to its refinenet? If yes, I'm curious on why you choose this decoder . Thank you!

Jimlee079 avatar Jun 26 '23 12:06 Jimlee079

@patricklabatut - any updates on the depth estimation code? I am having a hard time reproducing with the same quality you show in the paper

dariocazzani avatar Jun 28 '23 00:06 dariocazzani

@patricklabatut - any updates on the depth estimation code? I am having a hard time reproducing with the same quality you show in the paper

Hah, I adapted header of DPT from its official repo to DINOV2 . The accuracy is obviously lower than that in the paper.

emojilearning avatar Jun 28 '23 09:06 emojilearning

@patricklabatut - any updates on the depth estimation code? I am having a hard time reproducing with the same quality you show in the paper

Hah, I adapted header of DPT from its official repo to DINOV2 . The accuracy is obviously lower than that in the paper.

Hi how much RMSE did you get for depth estimation with DPT decoder? For NYUv2 or SUNRGBD? I am really interested in the results. Thank you very much! @emojilearning

Cindy0725 avatar Jun 28 '23 09:06 Cindy0725

Hi @patricklabatut, thanks for releasing the code and starting this issue to track progress on depth estimation.

I have tried to re-implement this but have not been successful (was unable to achieve an RMSE below 0.52 for ViT-B/14). My re-implementation is based on the following quoted part from Sec. 7.4. There are many details missing that I filled in, but I cannot seem to get the performance reported. I hope that this can help others who seem to also be struggle with reproducing this number as well as perhaps make it easy for the authors to highlight the key difference that would help us reproduce the depth probe.

I am basing my experiments on this part describing the simplest setup lin . 1 for ViT-B/14 which requires training a single linear layer on top of the frozen final layer's output

lin. 1: we extract the last layer of the frozen transformer and concatenate the [CLS] token to each patch token. Then we bi-linearly upsample the tokens by a factor of 4 to increase the resolution. Finally we train a simple linear layer using a classification loss by dividing the depth prediction range in 256 uniformly distributed bins and use a linear normalization following Bhat et al. (2021).

Below i detail my attempt based on the details provided in the paper:

Image extraction I simply assumed that you were training at a similar resolution as NYU (480x640), I went down (462x616) as they are multiple of 14x14 while keeping the aspect ratio. Depending on the setup, we might have augmentations or not. In the case of extracting dense features and training a layer, there might be no augmentations. Alternatively, we can keep the backbone frozen and training with image augmentations. I tried both, for augmentations, I used ColorJitter, RandomResizedCrop, Random Rotation (<= 10 degrees), RandomHorizontalFlip. With the exception of jitter, those augmentations were applied to both images and depth.

Feature Extraction The output tokens capture a grid that is 14x smaller than the full image. you can get the outputs of the patch tokens and the cls token from the output of dino and then reshape them into the correct shape as seen below. This results in an output of batch x 1536 x 33 x 44

import torch
import einops as E

vit = torch.hub.load("facebookresearch/dinov2", "dinov2_vitb14").cuda()
ret = vit.forward_features(image)

patch_tok = ret["x_norm_patchtokens"]
cls_tok = ret["x_norm_clstoken"]

_, _, img_h, img_w = image.shape
patch_h, patch_w = img_h / 14, img_w / 14

patch_tok = E.rearrange(patch_tok, "b (h w) c -> b c h w", h=patch_h, patch_w)                   
cls_tok = cls_tok[:, :, None, None].repeat(1, 1, patch_h, patch_h)
output = torch.cat((patch_tok, cls_tok), dim=1)              

Depth estimation The paper states that they bilinear upsample the features by a scale of 4 and then apply a linear layer. This leaves a resolution discrepancy of 3.5x. I tackled this by simply upsampling again to match the depth resolution. The linear layer is a simple 1x1 convolution applied to the grid that maps the features to a 256 dimensions vector depictng the probabilities for each of the depth bins. I then apply the AdaBins uniform-bin baseline which computes 256 depth values for each bin. The inner product of those two vectors is the output value. It is worth noting that both AdaBins and BinsFormer use adaptive bins for some minor performance gain, however, the difference in performance caused by bin choice is much smaller than the difference observed in performance.

Loss This is where things get a bit confusing. The paper seems to suggest that they use the BinsFormer with uniform bin size and 256 bins as noted above. This is typically trained with the scale-invariant depth loss estimates depth and then applies the loss. Using a classification loss, while possible, seemed like an odd choice. In that case, one would discretize the depth to 256 bins (I used a range 0-10m) and then apply a cross entropy loss. I tried both losses and the scale invariant loss does better.

Optimization I used AdamW (default parameters) with a cosine schedule for learning rate decay. I split the training data randomly at the level of room types with a train-val split of 0.7:0.3. I trained for 20 epochs. Training for 100 epochs didn't seem to help much.


As I noted, I have tried several different variants and none of them could achieve the performance reported in the paper. I would greatly appreciate any feedback from the authors with either their implementation or suggesting what might be different between the setup I described above and the setup used in the paper. Thank you!

mbanani avatar Jul 02 '23 23:07 mbanani

Hi @patricklabatut, thanks for releasing the code and starting this issue to track progress on depth estimation.

I have tried to re-implement this but have not been successful (was unable to achieve an RMSE below 0.52 for ViT-B/14). My re-implementation is based on the following quoted part from Sec. 7.4. There are many details missing that I filled in, but I cannot seem to get the performance reported. I hope that this can help others who seem to also be struggle with reproducing this number as well as perhaps make it easy for the authors to highlight the key difference that would help us reproduce the depth probe.

I am basing my experiments on this part describing the simplest setup lin . 1 for ViT-B/14 which requires training a single linear layer on top of the frozen final layer's output

lin. 1: we extract the last layer of the frozen transformer and concatenate the [CLS] token to each patch token. Then we bi-linearly upsample the tokens by a factor of 4 to increase the resolution. Finally we train a simple linear layer using a classification loss by dividing the depth prediction range in 256 uniformly distributed bins and use a linear normalization following Bhat et al. (2021).

Below i detail my attempt based on the details provided in the paper:

Image extraction I simply assumed that you were training at a similar resolution as NYU (480x640), I went down (462x616) as they are multiple of 14x14 while keeping the aspect ratio. Depending on the setup, we might have augmentations or not. In the case of extracting dense features and training a layer, there might be no augmentations. Alternatively, we can keep the backbone frozen and training with image augmentations. I tried both, for augmentations, I used ColorJitter, RandomResizedCrop, Random Rotation (<= 10 degrees), RandomHorizontalFlip. With the exception of jitter, those augmentations were applied to both images and depth.

Feature Extraction The output tokens capture a grid that is 14x smaller than the full image. you can get the outputs of the patch tokens and the cls token from the output of dino and then reshape them into the correct shape as seen below. This results in an output of batch x 1536 x 33 x 44

import torch
import einops as E

vit = torch.hub.load("facebookresearch/dinov2", "dinov2_vitb14").cuda()
ret = vit.forward_features(image)

patch_tok = ret["x_norm_patchtokens"]
cls_tok = ret["x_norm_clstoken"]

_, _, img_h, img_w = image.shape
patch_h, patch_w = img_h / 14, img_w / 14

patch_tok = E.rearrange(patch_tok, "b (h w) c -> b c h w", h=patch_h, patch_w)                   
cls_tok = cls_tok[:, :, None, None].repeat(1, 1, patch_h, patch_h)
output = torch.cat((patch_tok, cls_tok), dim=1)              

Depth estimation The paper states that they bilinear upsample the features by a scale of 4 and then apply a linear layer. This leaves a resolution discrepancy of 3.5x. I tackled this by simply upsampling again to match the depth resolution. The linear layer is a simple 1x1 convolution applied to the grid that maps the features to a 256 dimensions vector depictng the probabilities for each of the depth bins. I then apply the AdaBins uniform-bin baseline which computes 256 depth values for each bin. The inner product of those two vectors is the output value. It is worth noting that both AdaBins and BinsFormer use adaptive bins for some minor performance gain, however, the difference in performance caused by bin choice is much smaller than the difference observed in performance.

Loss This is where things get a bit confusing. The paper seems to suggest that they use the BinsFormer with uniform bin size and 256 bins as noted above. This is typically trained with the scale-invariant depth loss estimates depth and then applies the loss. Using a classification loss, while possible, seemed like an odd choice. In that case, one would discretize the depth to 256 bins (I used a range 0-10m) and then apply a cross entropy loss. I tried both losses and the scale invariant loss does better.

Optimization I used AdamW (default parameters) with a cosine schedule for learning rate decay. I split the training data randomly at the level of room types with a train-val split of 0.7:0.3. I trained for 20 epochs. Training for 100 epochs didn't seem to help much.

As I noted, I have tried several different variants and none of them could achieve the performance reported in the paper. I would greatly appreciate any feedback from the authors with either their implementation or suggesting what might be different between the setup I described above and the setup used in the paper. Thank you!

Hi @mbanani, thanks for sharing research details. I also concentrate on depth estimation task based on dinov2 backbone and obtained an unexpected result. for the simplest setup lin. 1 stated in the paper, firstly, I used the kitti dataset. for data preprocess, i just slightly resize the origin RGB image to satisfy "height(or width) % 14 == 0", while the dense depth groundtruth was resized using 'nearest' mode. I totally agree with the step of Feature Extraction you described.
for Depth estimation, I think the vision transformer backbone used in dinov2 naturally provide a spatially low-resolution feature, but with more embedding dimensions. I was also confused is there any operations to rescale the features to original image size instead of directly upsample by 4 and successively by 3.5. I tried the Unet decoder structure (no concat in my case), with successively upsampling by 2, 2, 2 and 1.75. between the two upsample blocks, conv2d was used to extract features and change the embedding dimension. Finally, the linear head was trained as a regression task using scale invariant loss. However, at the inference stage, the estimated depth (the selected image also from kitti) was unexpected. Especially for the scene where many cars parked on the side road.

Above is my experience and opinion, thank you

YirayWang avatar Jul 04 '23 08:07 YirayWang