LLaVA-NeXT icon indicating copy to clipboard operation
LLaVA-NeXT copied to clipboard

extracting image features

Open simoneriggi opened this issue 1 year ago • 0 comments

Dear all, I would like to extract the features (representation vector) of an input image from the pretrained SigLIP vision model. I tried the following, but I don't think this is the correct approach. Also, I don't understand the tensor dimension returned by these methods:

  • process_images(): returns a tensor of shape [1,10,3,384,384]. Is the 2 dimension (10) related to patches? Can I disable that?
  • encoder_images(): returns a tensor of shape [1, 729, 896]. Eventually I need a 1d vector with image features, so likely this is not the correct method to get them.

Can you help me?

# - LOAD MODEL
pretrained = "lmms-lab/llava-onevision-qwen2-0.5b-ov"
model_name = "llava_qwen"
device = "cuda"
device_map = "auto"
tokenizer, model, image_processor, max_length = load_pretrained_model(pretrained, None, model_name, device_map=device_map) 

model.eval()

## - LOAD IMAGE
url = "https://github.com/haotian-liu/LLaVA/blob/1a91fc274d7c35a9b50b3cb29c4247ae5837ce39/images/llava_v1_5_radar.jpg?raw=true"
image = Image.open(requests.get(url, stream=True).raw)
image_tensor = process_images([image], image_processor, model.config) ## This returns a tensor of shape [1,10,3,384,384]

## - SELECT FIRST PATCH (??)
t= image_tensor[:,0,:,:,:].to(dtype=torch.float16, device=device)

## - EXTRACT FEATURES
image_features= model.encode_images(t)  ## This returns a tensor of shape [1, 729, 896]

simoneriggi avatar Sep 12 '24 11:09 simoneriggi