LLaVA
LLaVA copied to clipboard
LLaVA-1.5 Inference Without Images Not Working Properly
Question
Hello! I am curious to run an experiment using LLaVA-1.5 without images (to clarify; it is important for my work to specifically prompt LLaVA-1.5 without images not Vicuna13B). I know this is certainly possible and was about to create my own mask for model.generate
, but given that LLaVA handles inputs in its own way, I thought I would check if there was a more integrated solution. So for example, in the the following line from model_vqa_loader.py
, I would like to generate a response without the image:
output_ids = model.generate(
input_ids,
images=image_tensor.to(dtype=torch.float16, device='cuda', non_blocking=True),
do_sample=True if args.temperature > 0 else False,
temperature=args.temperature,
top_p=args.top_p,
num_beams=args.num_beams,
max_new_tokens=args.max_new_tokens,
use_cache=True)
It seems based on the architecture in llava_arch.py
: simply passing None
for the image should be enough based on the following code:
def prepare_inputs_labels_for_multimodal(
self, input_ids, position_ids, attention_mask, past_key_values, labels, images
):
vision_tower = self.get_vision_tower()
if vision_tower is None or images is None or input_ids.shape[1] == 1:
if past_key_values is not None and vision_tower is not None and images is not None and input_ids.shape[1] == 1:
target_shape = past_key_values[-1][-1].shape[-2] + 1
attention_mask = torch.cat((attention_mask, torch.ones(
(attention_mask.shape[0], target_shape - attention_mask.shape[1]),
dtype=attention_mask.dtype,
device=attention_mask.device
)), dim=1)
position_ids = torch.sum(attention_mask, dim=1).unsqueeze(-1) - 1
return input_ids, position_ids, attention_mask, past_key_values, None, labels
But setting images=None
yields many CUDA errors that look like:
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [28
6,0,0], thread: [96,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
I also encountered the same problem. Does anyone know the solution for the inference of LLaVA-1.5 without images?
Strange. passing None should be sufficient -- ScienceQA has such pure-text questions and it works fine there.
https://github.com/haotian-liu/LLaVA/blob/main/llava/eval/model_vqa_science.py#L58
@haotian-liu @chancharikmitra
Thanks for your reply. By referring to model_vqa_science.py
, I found that I added DEFAULT_IMAGE_TOKEN to qs, which causes errors. By removing DEFAULT_IMAGE_TOKEN, this error can be solved.
Thanks for your feedback!
@AtsuMiyai @haotian-liu It still doesn't work for me. I have image files whose use I want to make optional.
I updated the run_llava.py
, line 114 as
if args.image_file is not None: image_files = image_parser(args) images = load_images(image_files) image_sizes = [x.size for x in images] images_tensor = process_images( images, image_processor, model.config ).to(model.device, dtype=torch.float16) else: images_tensor = None image_sizes = None input_ids = ( tokenizer_image_token(prompt, tokenizer, IMAGE_TOKEN_INDEX, return_tensors="pt") .unsqueeze(0) .cuda() ) with torch.inference_mode(): output_ids = model.generate( input_ids, images=None if images_tensor is None else images_tensor, image_sizes=image_sizes, do_sample=True if args.temperature > 0 else False, temperature=args.temperature, top_p=args.top_p, num_beams=args.num_beams, max_new_tokens=args.max_new_tokens, use_cache=True, )
Error is like
ndex: block: [120,0,0], thread: [68,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1711403388920/work/aten/src/ATen/native/cuda/Indexing.cu:1290: indexSelectLargeIndex: block: [120,0,0], thread: [69,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1711403388920/work/aten/src/ATen/native/cuda/Indexing.cu:1290: indexSelectLargeIndex: block: [120,0,0], thread: [70,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1711403388920/work/aten/src/ATen/native/cuda/Indexing.cu:1290: indexSelectLargeIndex: block: [120,0,0], thread: [71,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1711403388920/work/aten/src/ATen/native/cuda/Indexing.cu:1290: indexSelectLargeIndex: block: [120,0,0], thread: [72,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1711403388920/work/aten/src/ATen/native/cuda/Indexing.cu:1290: indexSelectLargeIndex: block: [120,0,0], thread: [73,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1711403388920/work/aten/src/ATen/native/cuda/Indexing.cu:1290: indexSelectLargeIndex: block: [120,0,0], thread: [74,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
@copperwiring Although, I am no longer working on this issue. I think the solution is likely similar to before, except the library has been updated for 1.6. I would check the inputs to see if there is any image-specific tokens that don't need to be there anymore. Perhaps, others working more directly on this will have found the exact edit.
Thanks! @chancharikmitra
@copperwiring
Thanks for your question!
Could you check whether image_token_se
is added in L61-70 in run_llava.py
?
image_token_se
is unnecessary for this case, so you can skip these processes in L61-70!
@AtsuMiyai Not really because in following:
qs = args.query
image_token_se = DEFAULT_IM_START_TOKEN + DEFAULT_IMAGE_TOKEN + DEFAULT_IM_END_TOKEN
if IMAGE_PLACEHOLDER in qs:
if model.config.mm_use_im_start_end:
qs = re.sub(IMAGE_PLACEHOLDER, image_token_se, qs)
else:
qs = re.sub(IMAGE_PLACEHOLDER, DEFAULT_IMAGE_TOKEN, qs)
else:
if model.config.mm_use_im_start_end:
qs = image_token_se + "\n" + qs
else:
qs = DEFAULT_IMAGE_TOKEN + "\n" + qs
qs = args.query
doesn't have any image. Image is paased via args.image_file
so the only line it processes in the if else loop is qs = DEFAULT_IMAGE_TOKEN + "\n" + qs
which doesnt have image_token_se
but I do have DEFAULT_IMAGE_TOKEN
(
https://github.com/haotian-liu/LLaVA/blob/3e337ad269da3245643a2724a1d694b5839c37f9/llava/eval/run_llava.py#L70). We should remove this too?
Tangentially, what does image_token_se
and DEFAULT_IMAGE_TOKEN
do btw?
From what I see, code breaks here:
https://github.com/haotian-liu/LLaVA/blob/3e337ad269da3245643a2724a1d694b5839c37f9/llava/model/language_model/llava_llama.py#L135
but I cant see where the input goes wrong?
Nevermind. Removing DEFAULT_IMAGE_TOKEN
indeed fixed it but if someone can explain me why that will be very helpful.