Depth-Anything
Depth-Anything copied to clipboard
RGBD Model
Hi, this is a very useful model, although becuase it uses dinov2 as an encoder, can we make it multi task model by adding a semantic segmentation head as well? If anyone has tried it out or even independently tried adding a decoder to dinov2 encoder and trained for semantic segmentation, I am all ears!
The Depth-Anything structure is based off of the 'DPT' structure from an earlier paper (MiDaS), and in that paper they mention that a DPT model can be converted from a depth estimator to semantic segmentation just by swapping out the 'head' component (see section A of the appendix on page 11 of the paper).
In the Depth-Anything implementation, this would correspond to replacing the last 3 lines with something like:
out = self.scratch.output_conv(path_1)
out = F.interpolate(out, (int(patch_h * 14), int(patch_w * 14)), mode="bilinear", align_corners=True)
Where the self.scratch.output_conv
piece would be given by the 'semantic segmentation head' structure defined in the DPT paper. Interestingly, this is actually already sort of in the Depth-Anything implementation. It gets initialized if the nclass
setting is >1 when the model is created. However, the forward function doesn't switch to using this alternate head currently and (as far as I'm aware) there aren't any weights for this included in the pre-trained model either. So it would require some (minor) code changes + additional training to support the segmentation capability.
I haven't tried any of this myself though, so I'm not sure if it would work any better than how segmentation was implemented on the dinov2 repo, but it's a neat (and in some ways simpler) alternative approach.