candle DinoV2 & Depth Anything V2: Bigger Models

~~Sorry to open a draft again~~, but while adding more models I found out why the DinoV2 example loads its safetensors from a different location than Facebook's own HF space: Facebook's safetensors have the query, key and value layers separate, whereas the original Python code, and thus the Candle implementation, have a single QKV layer. Furthermore, the original model file relies on separate head weights (guessing the head is not even necessary for DepthAnything, so might remove from example)

So what I've done here:

Load the weights from Facebook's HF space and pack them together into a single QKV layer
Make the head optional with an optional VarBuilder
Downloaded all head files from Facebook's GitHub and converted them to safetensors in my own HF space

Question: Do you agree with this approach?

Happy to adapt to other ideas, but if you agree, I will continue adding the base, large and giant models.

PS: I saw you also did some formatting on my previous PR and fixed some Clippy warnings. Maybe it's an idea to have a Contributor page that details the formatting requirements and other conventions?

Jun 25 '24 09:06 jeroenvlek

Disclaimer: I had some extra time, so I just proceeded in this direction. Happy to receive any feedback and adapt, of course!

I've added base, large and giant. The latter only for Dino as the DepthAnything giant model is still pending review.

Also removed the awkard config object handling and added the model to the README

Jun 26 '24 08:06 jeroenvlek

There still seems to be a bug in there, comparing this large implementation with the web based model output. This output has jitter and less pronounced depth. Will investigate.

// edit

I think I found it. The way they compute the patch dimensions and then use the forward method to pass those along still gives me some headache

Jun 26 '24 09:06 jeroenvlek

Just noticed that load_image returns (channels, height, width) but load_image_and_resize returns (channels, width, height)

I'll open an issue for that one.

//edit

#2291

Jun 27 '24 08:06 jeroenvlek

Small update: Originally I thought the difference in output quality was due to model size, but there were still some bugs in there. Most seem to be fixed. It's still quite hard to get the input conditions the same. For example, they convert to RGB from BGR, but then they resize with OpenCV and they're back to BGR (I verified this in the Python code by setting channels to zero).

Jun 29 '24 08:06 jeroenvlek

@LaurentMazare I uploaded a new depth image of the bike picture. In your experience, do you think this could be caused by the difference between UpsampleNearest2D and F.interpolate(img,height,width, mode="bilinear", align_corners=True)?

//edit

GPT-4o suggests it may be a source of discrepancies:

To fully align the Rust implementation with the PyTorch interpolate function, especially for mode='bilinear' with align_corners=True, you would need a more detailed and comprehensive implementation of bilinear interpolation as shown above. The nearest-neighbor interpolation provided in the Rust code serves as a good starting point but does not handle the complexity required for bilinear interpolation with the align_corners parameter.

I'll experiment a bit with an implementation that does do bilinear filtering and aligns the corners, like in PyTorch

Jun 29 '24 09:06 jeroenvlek

It's hard to say but I wouldn't expect it to make that much of a difference - I'm assuming that the image that you uploaded is the python output which looks great vs the rust output which is not that good. Usually for the bits that are not supported in candle, I just modify the python code to have the candle behavior e.g. I would try without the bilinear on the python side here to check that the python output is still meaningful.

Jun 29 '24 10:06 LaurentMazare

What I uploaded is the Rust output of the large model . The Python output is smoother. That's why I'm suspicious of the interpolation (used also deep inside the feature fusion blocks).

Let me investigate and apologies for all the noise :)

Jun 29 '24 10:06 jeroenvlek

@LaurentMazare I need a bit of guidance here: The pushed code follows the flow of the Python implementation fairly correct: The logic is the same. However, the results are still not as good as they should.

Python small:

small_20240224_072209

Rust small (this branch):

3_depth_20240224_072209

This is still not great. So I went further and opened another branch (not pushed) where I added:

image loading and resizing using opencv (I understand you might not want that as dep, this is just for testing)
Bilinear interpolation (Only in CPU storage, there doesn't seem to be a default Cuda kernel and it seems Candle doesn't have custom kernels)
Lots of debug statements and in between statistics (min, max and mean)
Load the Dino weights from the DepthAnythingV2 safetensors file and not from Facebook

The OpenCV loading and prep brings the inputs fairly close together (still -5.86% diff in means though), but then already in the DinoV2 model, things start to diverge rather badly:

model_layer_analysis.ods

Now, I'm not expecting you to look at or follow the analysis, but there's differences over 1400% between the means there.

My main question would be: How well do other ports of PyTorch models to Candle follow the original numerically? How much tolerance is there for numerical deviation? I would argue that in principle there shouldn't be any deviations at all and these statistics should match rather closely? And, how chaotic can a model behave, i.e. could a 5.86% difference as input snowball into these kind of deviations?

Jul 03 '24 08:07 jeroenvlek

Submitting this PR for now, as it is definitely an improvement over #2279

There's still unexplained numerical deviations in there, that already start in the applications of the DinoV2 blocks. Whether, this is something fundamental or obvious I was not able to discover, but I have stared at this problem long enough :)

Jul 06 '24 08:07 jeroenvlek

Thanks for looking into this, the blocky results with your latest version certainly feel like some form of interpolation is off (or some upsampling of the depth results?). Anyway the PR looks good but would it be possible not to make changes on the dinov2 side? This model code might already be used so better not to break compatibility and try to reduce the diff as much as possible.

Jul 06 '24 09:07 LaurentMazare

I wouldn't mind doing that, but that was a point of the first post: The weights Facebook put on HF have the query, key and value tensors separate.

However, I believe the Depth Anything weights (which contain a copy of the Dino weights) are packed in the old/different format, so let me try with that.

Regarding interpolation: I tried bilinear on the candle side (like mentioned above Cpu only), and I tried nearest neighbor interpolation inside the Pytorch implementation's interpolate_pos_encosing() like your comment here

According to my rudimentary analysis, however, the mean divergence already starts while getting the intermediate blocks (so the forward method of the Dino blocks produces different results)

Anyway, happy to reduce the diff. Will work on that tomorrow :)

Jul 06 '24 11:07 jeroenvlek

Thanks for looking into this, the blocky results with your latest version certainly feel like some form of interpolation is off (or some upsampling of the depth results?). Anyway the PR looks good but would it be possible not to make changes on the dinov2 side? This model code might already be used so better not to break compatibility and try to reduce the diff as much as possible.

@LaurentMazare Just to be clear: These new prefixes follow the Facebook naming and they also have the bigger models (base, large, giant). If we keep the old prefixes, then we need to compact the Q, K, V layers into a single layer, add the head weights, and export the 3 models to safetensors.

Is that really what you want? (I mean Candle uses versioning, although I do understand these are breaking changes)

Jul 08 '24 07:07 jeroenvlek

Back to the old prefixes. Oddly, using their weights instead of the Facebook weights, yields a worse result:

depth_20240224_072209

Bilinear upsampling does improve it a bit, but I removed it because:

I didn't completely trust GPT-4o generated code, especially after challenging GPT-4o and it changed the way strides were handled
to keep the PR small
to my knowledge there's no default kernel in Cuda that does bilinear upsampling
It only improves it, but it doesn't fix the fundamental result being blobby

For posterity, this is with Facebook weights and CPU bilinear upscaling:

depth_20240224_072209

Jul 10 '24 08:07 jeroenvlek

Where are we with this? It would be great to at least get the support for loading facebook models directly. I think that should be opened as a separate PR again and merged. The rest can probably live in this PR.

Sep 06 '25 15:09 metalmatze

Where are we with this? It would be great to at least get the support for loading facebook models directly. I think that should be opened as a separate PR again and merged. The rest can probably live in this PR.

Where are we with this? I haven't touched this since over a year, since I felt I was commenting into the void. Unfortunately, I don't have time atm to put the Facebook model loading into a separate PR, but feel free to cherry-pick/copy what you need.

Or alternatively, we can merge this, but we accept that the performance of DepthAnythingV2 is visibly worse than the Python implementation due to numerical deviations that I was unable to pin down.

Sep 09 '25 11:09 jeroenvlek

candle candle copied to clipboard

DinoV2 & Depth Anything V2: Bigger Models

candle
candle copied to clipboard