ImageBind
ImageBind copied to clipboard
How to use Depth embedding.
Thanks for great work! I want to use Depth embedding in ImageBind, but I cannot get good results... Please instruct how to use depth embeddings..
・depth estimator and create depth image
from transformers import DPTFeatureExtractor, DPTForDepthEstimation
import torch
import numpy as np
from PIL import Image
feature_extractor = DPTFeatureExtractor.from_pretrained("Intel/dpt-large")
model = DPTForDepthEstimation.from_pretrained("Intel/dpt-large")
text = "bird"
image = Image.open(f"/content/ImageBind/.assets/{text}_image.jpg")
encoding = feature_extractor(image, return_tensors="pt")
# forward pass
with torch.no_grad():
outputs = model(**encoding)
predicted_depth = outputs.predicted_depth
# interpolate to original size
prediction = torch.nn.functional.interpolate(
predicted_depth.unsqueeze(1),
size=image.size[::-1],
mode="bicubic",
align_corners=False,
).squeeze()
output = prediction.cpu().numpy()
formatted = (output * 255 / np.max(output)).astype('uint8')
img = Image.fromarray(formatted)
img.save(f"/content/ImageBind/.assets/{text}_depth.jpg")
・after that, inference with the following code
from torchvision import transforms
from PIL import Image
def load_and_transform_depth_data(depth_paths, device):
if depth_paths is None:
return None
depth_ouputs = []
for depth_path in depth_paths:
data_transform = transforms.Compose(
[
transforms.Resize(
224, interpolation=transforms.InterpolationMode.BICUBIC
),
transforms.CenterCrop(224),
transforms.ToTensor(),
# transforms.Normalize((0.5, ), (0.5, )) # if I use this normalization, I cannot get good results...
]
)
with open(depth_path, "rb") as fopen:
image = Image.open(fopen).convert("L")
image = data_transform(image).to(device)
depth_ouputs.append(image)
return torch.stack(depth_ouputs, dim=0)
import data
import torch
from models import imagebind_model
from models.imagebind_model import ModalityType
text_list=["A dog.", "A car", "A bird"]
image_paths=[".assets/dog_image.jpg", ".assets/car_image.jpg", ".assets/bird_image.jpg"]
audio_paths=[".assets/dog_audio.wav", ".assets/car_audio.wav", ".assets/bird_audio.wav"]
depth_paths = [".assets/dog_depth.jpg", ".assets/car_depth.jpg", ".assets/bird_depth.jpg"]
device = "cuda:0" if torch.cuda.is_available() else "cpu"
# Instantiate model
model = imagebind_model.imagebind_huge(pretrained=True)
model.eval()
model.to(device)
# Load data
inputs = {
ModalityType.TEXT: data.load_and_transform_text(text_list, device),
ModalityType.VISION: data.load_and_transform_vision_data(image_paths, device),
ModalityType.AUDIO: data.load_and_transform_audio_data(audio_paths, device),
ModalityType.DEPTH: load_and_transform_depth_data(depth_paths, device),
}
with torch.no_grad():
embeddings = model(inputs)
print(
"Vision x Depth: ",
torch.softmax(embeddings[ModalityType.VISION] @ embeddings[ModalityType.DEPTH].T, dim=-1),
)
print(
"Text x Depth: ",
torch.softmax(embeddings[ModalityType.TEXT] @ embeddings[ModalityType.DEPTH].T, dim=-1),
)
print(
"Depth x Audio: ",
torch.softmax(embeddings[ModalityType.DEPTH] @ embeddings[ModalityType.AUDIO].T, dim=-1),
)
・output
Vision x Depth: tensor([[0.3444, 0.3040, 0.3516],
[0.3451, 0.2363, 0.4186],
[0.3517, 0.3634, 0.2849]], device='cuda:0')
Text x Depth: tensor([[9.5571e-01, 4.4270e-02, 1.5210e-05],
[5.6266e-01, 4.3734e-01, 9.7014e-10],
[4.6230e-06, 1.0000e+00, 7.2704e-15]], device='cuda:0')
Depth x Audio: tensor([[1.9618e-01, 1.4769e-02, 7.8905e-01],
[1.5248e-02, 4.6171e-03, 9.8014e-01],
[1.5896e-04, 1.8075e-02, 9.8177e-01]], device='cuda:0')
Please replay!
Same question. The paper said that the depth maps are transformed into disparity maps. Will this matter? @softmurata
Same question. And I tried to use the depth map to disparity map code by @imisra from here, still did not get reasonable results.
I am also interested in how to use the depth embeddings properly, not getting good results.
not sure if it is because the dog/car/bird cases do not appear in the training set of ImageBind
We can use absolute depth in meters to inference by this repo
@imisra Hello, I want to know how to preprocess the disparity map gotten by this code ? Thanks !
I filter samples of the 19 classes and get top1 acc 34.51 on sunRGBD-only