What does this PR do?

Update the Depth Anything conversion script to support V2 models.

The only architectural change is the use of intermediate features instead of the outputs from the last 4 features.

This is already supported in the backend configuration, so the change simply involves updating the configuration

Converted models (no model card or license information):

Small
Base
Large
Giant (not yet published by the authors).

Pending to do, if this approach is accepted:

Complete the model cards and transfer the models to the https://huggingface.co/depth-anything organization, assuming the authors agree to it.
Update docs.
Update tests, if necessary.

Before submitting

[ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
[x] Did you read the contributor guideline, Pull Request section?
[ ] Was this discussed/approved via a Github issue or the forum? Please add a link to it if that's the case.
[ ] Did you make sure to update the documentation with your changes? Here are the documentation guidelines, and here are tips on formatting docstrings.
[ ] Did you write any new necessary tests?

Who can review?

@NielsRogge, @amyeroberts cc @LiheYoung, @bingykang

Jun 20 '24 16:06 pcuenca

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Jun 20 '24 16:06 HuggingFaceDocBuilderDev

@pcuenca, thank you for your conversion. Have you compared the prediction of the converted transformers model with our original V2 codebase? I previously made a similar modification as your current PRs in the cloned transformers. But I find the results can not be exactly aligned in this verification line. There is a gap of around 1e-2 between the two model predictions.

Jun 20 '24 21:06 LiheYoung

Hi @LiheYoung, thanks for checking!

Yes, I could replicate exactly the results from the small version of the model, applying the same inputs to both the original and the transformers implementations. The reference implementation I used was the one from your demo Space. I saved the depth output from the second image example (the sunflowers) as a numpy array, and verified transformers inference with the following code:

from transformers import AutoModelForDepthEstimation, AutoProcessor
from PIL import Image
import torch
import torch.nn.functional as F
import numpy as np
from torchvision.transforms import Compose

# Copied from source code
from depth_anything_transform import *

model_id = "pcuenq/Depth-Anything-V2-Small-hf"
processor = AutoProcessor.from_pretrained(model_id)
model = AutoModelForDepthEstimation.from_pretrained(model_id).eval()

image = Image.open("space/Depth-Anything-V2/examples/demo02.jpg")
w, h = image.size

# Manually pre-process to match the original source code
# The transformers pre-processor produces slightly different values for some reason

transform = Compose([
    Resize(
        width=518,
        height=518,
        resize_target=False,
        keep_aspect_ratio=True,
        ensure_multiple_of=14,
        resize_method='lower_bound',
        image_interpolation_method=cv2.INTER_CUBIC,
    ),
    NormalizeImage(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]),
    PrepareForNet(),
])
pixel_values = np.array(image) / 255.0
pixel_values = transform({'image': pixel_values})['image']
pixel_values = torch.from_numpy(pixel_values).unsqueeze(0)

with torch.inference_mode():
    # DA2 processor
    outputs = model(pixel_values=pixel_values, output_hidden_states=False)

    # Transformers Processor
    inputs = processor(images=image, return_tensors="pt")
    outputs_transformers = model(**inputs, output_hidden_states=False)

# Compare with results from the same image obtained with https://huggingface.co/spaces/depth-anything/Depth-Anything-V2
def compare_with_reference(outputs, reference_depth, filename):
    depth = outputs["predicted_depth"]
    depth = F.interpolate(depth[:, None], (h, w), mode="bilinear", align_corners=True)[0, 0]
    max_diff = np.abs(depth - reference_depth).max()
    mean_diff = np.abs(depth - reference_depth).mean()
    print(f"Sum of absolute differences vs baseline: {np.sum(np.abs(depth.numpy() - reference_depth))}")
    print(f"Difference using transformers processor, max: {max_diff}, mean: {mean_diff}")

    # raw_depth = Image.fromarray(depth.numpy().astype('uint16'))
    depth = (depth - depth.min()) / (depth.max() - depth.min()) * 255.0
    depth = depth.numpy().astype(np.uint8)
    # colored_depth = (cmap(depth)[:, :, :3] * 255).astype(np.uint8)

    gray_depth = Image.fromarray(depth)
    gray_depth.save(filename)

reference_depth = np.load("space/Depth-Anything-V2/depth_gradio.npy")
compare_with_reference(outputs, reference_depth, "gray_depth.png")
compare_with_reference(outputs_transformers, reference_depth, "gray_depth_transformers.png")

Results are identical when the same pre-processing steps are used, but are not equal when using the transformers pre-processor. I assume most of the difference will come from the resampling algorithms (the original code uses OpenCV, while transformers uses PIL). I also assume (but didn't check) that the same processor differences will affect the v1 version as well.

cc @NielsRogge in case he has additional insight

Jun 22 '24 17:06 pcuenca

Hi @pcuenca, thank you for your clarification and efforts! I checked the sample code and also found slight differences between transformers's bicubic interpolation and OpenCV's cubic interpolation used by our original code. It seems inevitable in current transformers. So I am okay with this pull. Thank you.

Jun 29 '24 01:06 LiheYoung

Thank you @LiheYoung! Can we move the transformers checkpoints to your https://huggingface.co/depth-anything organization? (I can update the model cards before we do).

Jul 01 '24 11:07 pcuenca

Sure @pcuenca, thank you all!

Jul 02 '24 01:07 LiheYoung

Thanks @amyeroberts @NielsRogge for the guidance! The test failure seems unrelated, but happy to revisit if necessary.

@LiheYoung I transferred the models to your organization and updated the model cards, feel free to make changes or create a collection :)

Jul 05 '24 13:07 pcuenca

Merging as the changes are unrelated to this PR

Jul 05 '24 18:07 amyeroberts

Thank you for all your efforts! I will link our repository to these models.

Jul 06 '24 01:07 LiheYoung

Depth Anything: update conversion script for V2

What does this PR do?

Before submitting

Who can review?