LLaVA LLaVA-1.5 Inference Without Images Not Working Properly

Question

Hello! I am curious to run an experiment using LLaVA-1.5 without images (to clarify; it is important for my work to specifically prompt LLaVA-1.5 without images not Vicuna13B). I know this is certainly possible and was about to create my own mask for model.generate, but given that LLaVA handles inputs in its own way, I thought I would check if there was a more integrated solution. So for example, in the the following line from model_vqa_loader.py, I would like to generate a response without the image:

output_ids = model.generate(
                input_ids,
                images=image_tensor.to(dtype=torch.float16, device='cuda', non_blocking=True),
                do_sample=True if args.temperature > 0 else False,
                temperature=args.temperature,
                top_p=args.top_p,
                num_beams=args.num_beams,
                max_new_tokens=args.max_new_tokens,
                use_cache=True)

It seems based on the architecture in llava_arch.py: simply passing None for the image should be enough based on the following code:

def prepare_inputs_labels_for_multimodal(
        self, input_ids, position_ids, attention_mask, past_key_values, labels, images
    ):
        vision_tower = self.get_vision_tower()
        if vision_tower is None or images is None or input_ids.shape[1] == 1:
            if past_key_values is not None and vision_tower is not None and images is not None and input_ids.shape[1] == 1:
                target_shape = past_key_values[-1][-1].shape[-2] + 1
                attention_mask = torch.cat((attention_mask, torch.ones(
                    (attention_mask.shape[0], target_shape - attention_mask.shape[1]),
                    dtype=attention_mask.dtype,
                    device=attention_mask.device
                )), dim=1)
                position_ids = torch.sum(attention_mask, dim=1).unsqueeze(-1) - 1
            return input_ids, position_ids, attention_mask, past_key_values, None, labels

But setting images=None yields many CUDA errors that look like:

../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [28
6,0,0], thread: [96,0,0] Assertion `srcIndex < srcSelectDimSize` failed.

Nov 21 '23 22:11 chancharikmitra

I also encountered the same problem. Does anyone know the solution for the inference of LLaVA-1.5 without images?

Dec 19 '23 00:12 AtsuMiyai

Strange. passing None should be sufficient -- ScienceQA has such pure-text questions and it works fine there.

https://github.com/haotian-liu/LLaVA/blob/main/llava/eval/model_vqa_science.py#L58

Dec 19 '23 00:12 haotian-liu

@haotian-liu @chancharikmitra Thanks for your reply. By referring to model_vqa_science.py, I found that I added DEFAULT_IMAGE_TOKEN to qs, which causes errors. By removing DEFAULT_IMAGE_TOKEN, this error can be solved. Thanks for your feedback!

Dec 19 '23 05:12 AtsuMiyai

@AtsuMiyai @haotian-liu It still doesn't work for me. I have image files whose use I want to make optional.

I updated the run_llava.py, line 114 as

if args.image_file is not None:
    image_files = image_parser(args)
    images = load_images(image_files)
    image_sizes = [x.size for x in images]
    images_tensor = process_images(
            images,
            image_processor,
            model.config
        ).to(model.device, dtype=torch.float16)
else:
    images_tensor = None
    image_sizes = None


input_ids = (
    tokenizer_image_token(prompt, tokenizer, IMAGE_TOKEN_INDEX, return_tensors="pt")
    .unsqueeze(0)
    .cuda()
)

with torch.inference_mode():
    output_ids = model.generate(
        input_ids,
        images=None if images_tensor is None else images_tensor,
        image_sizes=image_sizes,
        do_sample=True if args.temperature > 0 else False,
        temperature=args.temperature,
        top_p=args.top_p,
        num_beams=args.num_beams,
        max_new_tokens=args.max_new_tokens,
        use_cache=True,
    )

Error is like

ndex: block: [120,0,0], thread: [68,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1711403388920/work/aten/src/ATen/native/cuda/Indexing.cu:1290: indexSelectLargeIndex: block: [120,0,0], thread: [69,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1711403388920/work/aten/src/ATen/native/cuda/Indexing.cu:1290: indexSelectLargeIndex: block: [120,0,0], thread: [70,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1711403388920/work/aten/src/ATen/native/cuda/Indexing.cu:1290: indexSelectLargeIndex: block: [120,0,0], thread: [71,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1711403388920/work/aten/src/ATen/native/cuda/Indexing.cu:1290: indexSelectLargeIndex: block: [120,0,0], thread: [72,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1711403388920/work/aten/src/ATen/native/cuda/Indexing.cu:1290: indexSelectLargeIndex: block: [120,0,0], thread: [73,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1711403388920/work/aten/src/ATen/native/cuda/Indexing.cu:1290: indexSelectLargeIndex: block: [120,0,0], thread: [74,0,0] Assertion `srcIndex < srcSelectDimSize` failed.

Apr 25 '24 11:04 copperwiring

@copperwiring Although, I am no longer working on this issue. I think the solution is likely similar to before, except the library has been updated for 1.6. I would check the inputs to see if there is any image-specific tokens that don't need to be there anymore. Perhaps, others working more directly on this will have found the exact edit.

Apr 25 '24 16:04 chancharikmitra

Thanks! @chancharikmitra

@copperwiring Thanks for your question! Could you check whether image_token_se is added in L61-70 in run_llava.py? image_token_se is unnecessary for this case, so you can skip these processes in L61-70!

Apr 25 '24 16:04 AtsuMiyai

@AtsuMiyai Not really because in following:

    qs = args.query
    image_token_se = DEFAULT_IM_START_TOKEN + DEFAULT_IMAGE_TOKEN + DEFAULT_IM_END_TOKEN
    if IMAGE_PLACEHOLDER in qs:
        if model.config.mm_use_im_start_end:
            qs = re.sub(IMAGE_PLACEHOLDER, image_token_se, qs)
        else:
            qs = re.sub(IMAGE_PLACEHOLDER, DEFAULT_IMAGE_TOKEN, qs)
    else:
        if model.config.mm_use_im_start_end:
            qs = image_token_se + "\n" + qs
        else:
            qs = DEFAULT_IMAGE_TOKEN + "\n" + qs

qs = args.query doesn't have any image. Image is paased via args.image_file so the only line it processes in the if else loop is qs = DEFAULT_IMAGE_TOKEN + "\n" + qs which doesnt have image_token_se

but I do have DEFAULT_IMAGE_TOKEN ( https://github.com/haotian-liu/LLaVA/blob/3e337ad269da3245643a2724a1d694b5839c37f9/llava/eval/run_llava.py#L70). We should remove this too?

Tangentially, what does image_token_se and DEFAULT_IMAGE_TOKEN do btw?

Apr 26 '24 12:04 copperwiring

From what I see, code breaks here:

https://github.com/haotian-liu/LLaVA/blob/3e337ad269da3245643a2724a1d694b5839c37f9/llava/model/language_model/llava_llama.py#L135

but I cant see where the input goes wrong?

Apr 26 '24 12:04 copperwiring

Nevermind. Removing DEFAULT_IMAGE_TOKEN indeed fixed it but if someone can explain me why that will be very helpful.

Apr 26 '24 18:04 copperwiring

LLaVA LLaVA copied to clipboard

LLaVA-1.5 Inference Without Images Not Working Properly

Question

LLaVA
LLaVA copied to clipboard