Vishaal Udandarao
Vishaal Udandarao
I believe this is the first variant of CLIP-Adapter that they describe in Section 4.1.1 of the paper (Training Settings). The exact quote from the paper is: "The first variant...
Hey, could you please share your evaluation script that you have?
Did anybody manage to find some useful resources for this?
Thanks for the comments -- I've updated the head PR comment and the eval samples, incorporating your comments. Please take a look and let me know if this works! Note:...
Yes, I agree with @MaxFBurg, are there any such implementation plans?
Thanks for your response @haotian-liu I tried replacing these lines in your eval script `llava.eval.run_llava.py`: ```python qs = args.query if mm_use_im_start_end: qs = qs + '\n' + DEFAULT_IM_START_TOKEN + DEFAULT_IMAGE_PATCH_TOKEN...
Yes, this is the code I updated: ```python image = load_image(args.image_file) image_tensor = image_processor.preprocess(image, return_tensors='pt')['pixel_values'][0] ``` with ```python image_tensor = torch.stack( [ image_processor.preprocess(load_image(image_file), return_tensors="pt")["pixel_values"][0] for image_file in args.image_file.split(",") ] )...
Thanks @haotian-liu, so I assume the above implementation for a single turn multi-image inference is correct, but its an OOD problem due to the current training set-up of the model...
@cyril-mino Sorry I don't get your question -- what do you mean by structure of data input? I just pass in two images to the model as a list of...
Hi Adriel, as far as I recall (I must admit I haven't looked at this script in over a month) the `model.generate` function was able to take in multiple input...