LLaVA-NeXT
LLaVA-NeXT copied to clipboard
extracting image features
Dear all, I would like to extract the features (representation vector) of an input image from the pretrained SigLIP vision model. I tried the following, but I don't think this is the correct approach. Also, I don't understand the tensor dimension returned by these methods:
process_images(): returns a tensor of shape [1,10,3,384,384]. Is the 2 dimension (10) related to patches? Can I disable that?encoder_images(): returns a tensor of shape [1, 729, 896]. Eventually I need a 1d vector with image features, so likely this is not the correct method to get them.
Can you help me?
# - LOAD MODEL
pretrained = "lmms-lab/llava-onevision-qwen2-0.5b-ov"
model_name = "llava_qwen"
device = "cuda"
device_map = "auto"
tokenizer, model, image_processor, max_length = load_pretrained_model(pretrained, None, model_name, device_map=device_map)
model.eval()
## - LOAD IMAGE
url = "https://github.com/haotian-liu/LLaVA/blob/1a91fc274d7c35a9b50b3cb29c4247ae5837ce39/images/llava_v1_5_radar.jpg?raw=true"
image = Image.open(requests.get(url, stream=True).raw)
image_tensor = process_images([image], image_processor, model.config) ## This returns a tensor of shape [1,10,3,384,384]
## - SELECT FIRST PATCH (??)
t= image_tensor[:,0,:,:,:].to(dtype=torch.float16, device=device)
## - EXTRACT FEATURES
image_features= model.encode_images(t) ## This returns a tensor of shape [1, 729, 896]