ImageBind How to use Depth embedding.

Thanks for great work! I want to use Depth embedding in ImageBind, but I cannot get good results... Please instruct how to use depth embeddings..

・depth estimator and create depth image

from transformers import DPTFeatureExtractor, DPTForDepthEstimation
import torch
import numpy as np
from PIL import Image

feature_extractor = DPTFeatureExtractor.from_pretrained("Intel/dpt-large")
model = DPTForDepthEstimation.from_pretrained("Intel/dpt-large")

text = "bird"
image = Image.open(f"/content/ImageBind/.assets/{text}_image.jpg")

encoding = feature_extractor(image, return_tensors="pt")
    
# forward pass
with torch.no_grad():
  outputs = model(**encoding)
  predicted_depth = outputs.predicted_depth
    
# interpolate to original size
prediction = torch.nn.functional.interpolate(
                        predicted_depth.unsqueeze(1),
                        size=image.size[::-1],
                        mode="bicubic",
                        align_corners=False,
    ).squeeze()
output = prediction.cpu().numpy()
formatted = (output * 255 / np.max(output)).astype('uint8')
img = Image.fromarray(formatted)
img.save(f"/content/ImageBind/.assets/{text}_depth.jpg")

・after that, inference with the following code

from torchvision import transforms
from PIL import Image
def load_and_transform_depth_data(depth_paths, device):
    if depth_paths is None:
        return None

    depth_ouputs = []
    for depth_path in depth_paths:
        data_transform = transforms.Compose(
            [
                transforms.Resize(
                    224, interpolation=transforms.InterpolationMode.BICUBIC
                ),
                transforms.CenterCrop(224),
                transforms.ToTensor(),
                # transforms.Normalize((0.5, ), (0.5, ))  # if I use this normalization, I cannot get good results...
            ]
        )
        with open(depth_path, "rb") as fopen:
            image = Image.open(fopen).convert("L")

        image = data_transform(image).to(device)
        depth_ouputs.append(image)
    return torch.stack(depth_ouputs, dim=0)


import data
import torch
from models import imagebind_model
from models.imagebind_model import ModalityType

text_list=["A dog.", "A car", "A bird"]
image_paths=[".assets/dog_image.jpg", ".assets/car_image.jpg", ".assets/bird_image.jpg"]
audio_paths=[".assets/dog_audio.wav", ".assets/car_audio.wav", ".assets/bird_audio.wav"]
depth_paths = [".assets/dog_depth.jpg", ".assets/car_depth.jpg", ".assets/bird_depth.jpg"]

device = "cuda:0" if torch.cuda.is_available() else "cpu"

# Instantiate model
model = imagebind_model.imagebind_huge(pretrained=True)
model.eval()
model.to(device)

# Load data
inputs = {
    ModalityType.TEXT: data.load_and_transform_text(text_list, device),
    ModalityType.VISION: data.load_and_transform_vision_data(image_paths, device),
    ModalityType.AUDIO: data.load_and_transform_audio_data(audio_paths, device),
    ModalityType.DEPTH: load_and_transform_depth_data(depth_paths, device),
}

with torch.no_grad():
    embeddings = model(inputs)

print(
    "Vision x Depth: ",
    torch.softmax(embeddings[ModalityType.VISION] @ embeddings[ModalityType.DEPTH].T, dim=-1),
)
print(
    "Text x Depth: ",
    torch.softmax(embeddings[ModalityType.TEXT] @ embeddings[ModalityType.DEPTH].T, dim=-1),
)
print(
    "Depth x Audio: ",
    torch.softmax(embeddings[ModalityType.DEPTH] @ embeddings[ModalityType.AUDIO].T, dim=-1),
)

・output

Vision x Depth:  tensor([[0.3444, 0.3040, 0.3516],
        [0.3451, 0.2363, 0.4186],
        [0.3517, 0.3634, 0.2849]], device='cuda:0')
Text x Depth:  tensor([[9.5571e-01, 4.4270e-02, 1.5210e-05],
        [5.6266e-01, 4.3734e-01, 9.7014e-10],
        [4.6230e-06, 1.0000e+00, 7.2704e-15]], device='cuda:0')
Depth x Audio:  tensor([[1.9618e-01, 1.4769e-02, 7.8905e-01],
        [1.5248e-02, 4.6171e-03, 9.8014e-01],
        [1.5896e-04, 1.8075e-02, 9.8177e-01]], device='cuda:0')

Please replay!

May 10 '23 01:05 softmurata

Same question. The paper said that the depth maps are transformed into disparity maps. Will this matter? @softmurata

May 24 '23 12:05 ZrrSkywalker

Same question. And I tried to use the depth map to disparity map code by @imisra from here, still did not get reasonable results.

May 31 '23 07:05 antonioo-c

I am also interested in how to use the depth embeddings properly, not getting good results.

Jun 10 '23 06:06 omaralvarez

not sure if it is because the dog/car/bird cases do not appear in the training set of ImageBind

Aug 27 '23 17:08 StanLei52

We can use absolute depth in meters to inference by this repo

Oct 14 '23 08:10 llziss4ai

@imisra Hello, I want to know how to preprocess the disparity map gotten by this code ? Thanks !

I filter samples of the 19 classes and get top1 acc 34.51 on sunRGBD-only

Mar 28 '24 07:03 xiaos16

ImageBind ImageBind copied to clipboard

How to use Depth embedding.

ImageBind
ImageBind copied to clipboard