Macaw-LLM icon indicating copy to clipboard operation
Macaw-LLM copied to clipboard

Always have same response

Open kehanlu opened this issue 1 year ago • 5 comments

Hi, I have loaded your pre-trained weights and tried some instructions. However, I found the model responded with the same answer no matter what image I gave.

model = MM_LLMs.from_pretrained(
        "trained_model/mm_llms_trainer",
        config = model_config,
    )
model.eval()
# ...

instruction = "How many boats are in the picture?"
template = f"Below is an instruction that describes a task. Write a response that appropriately completes the request.\n\n### Instruction:\n{instruction}\n\n### Response:"

input_ids = tokenizer.encode(template.format(instruction))
eos_token_id = tokenizer.eos_token_id
if eos_token_id in input_ids:
    input_ids.remove(eos_token_id)
input_ids = torch.tensor([input_ids], dtype=torch.int).to(device)

# image
# image = preprocess(Image.open("data/image_sample/COCO_train2014_000000492606.jpg"))
# image = preprocess(Image.open("data/image_sample/COCO_train2014_000000344896.jpg"))
image = preprocess(Image.open("data/image_sample/COCO_train2014_000000407061.jpg"))
image = image.unsqueeze(0)

with torch.no_grad():
    bs = 1
    
    inputs = {
        "videos": None,
        "images": image.half(),
        "audios": None,
        "input_ids": input_ids,
        'image_starts': torch.tensor([tokenizer.convert_tokens_to_ids('<image>')] * bs, dtype=torch.int),
        'image_ends': torch.tensor([tokenizer.convert_tokens_to_ids('</image>')] * bs, dtype=torch.int),
        'audio_starts': torch.tensor([tokenizer.convert_tokens_to_ids('<audio>')] * bs, dtype=torch.int),
        'audio_ends': torch.tensor([tokenizer.convert_tokens_to_ids('</audio>')] * bs, dtype=torch.int),
        'video_starts': torch.tensor([tokenizer.convert_tokens_to_ids('<video>')] * bs, dtype=torch.int),
        'video_ends': torch.tensor([tokenizer.convert_tokens_to_ids('</video>')] * bs, dtype=torch.int),
    }

    for k,v in inputs.items():
        if v is not None:
            inputs[k] = v.to(device)
    inputs['inference'] = True
    
    
    text_embeddings, attention_mask, labels, debug = model.prepare_inputs_for_generation(inputs)
    
    print()
    print(text_embeddings.size())
        

    model_output = model.llm(inputs_embeds=text_embeddings, attention_mask=attention_mask, labels=labels)
    generate_ids = model.llm.generate(inputs_embeds=text_embeddings, max_new_tokens=128, eos_token_id=2, bos_token_id=1, pad_token_id=32006)
Below is an instruction that describes a task. Write a response that appropriately completes the request.

### Instruction:
How many boats are in the picture?

### Response:
========================================
There are 5000 in the picture.
========================================

No matter what image I gave to the model. The model always replies There are 5000 in the picture. with the same prompt. It seems the model just ignored any multi-modal inputs and replied based on text.

Did I do anything wrong? Thank you.

kehanlu avatar Jul 20 '23 14:07 kehanlu

How did you get the tokenizer?

Regarding your problem, I think maybe it's because you are using model.llm, which is just the llama part? In this case, seems the whisper and clip part are not used.

From what I understand, we can run the model by:

model.eval()
with torch.no_grad():
    generate_ids = model(data_item)
input_texts = TOKENIZER.batch_decode(data_item["input_ids"], skip_special_tokens=True, clean_up_tokenization_spaces=False)
generated_texts = TOKENIZER.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)
print(generated_texts)

chatsci avatar Jul 22 '23 03:07 chatsci

Hi, thanks for sharing the infomation. We are currently checking it.

lyuchenyang avatar Jul 22 '23 08:07 lyuchenyang

Hi @chatsci, My code is modified from llm_trainer.py and modeling.py.

https://github.com/lyuchenyang/Macaw-LLM/blob/d03e59d24e12b97390ab687652977fc4407e537b/llm_trainer.py#L466-L489

https://github.com/lyuchenyang/Macaw-LLM/blob/d03e59d24e12b97390ab687652977fc4407e537b/modeling.py#L952-L963

I call the functions inside model() forward to test it more easily. The function prepare_inputs_for_generation will prepare the multi-modal tokens for the following LLM (encode the multi-modal features and concatenate with the text instruction).

I'm pretty sure that the input tokens for LLM contain image tokens. While conducting tests, I noticed that the model appears to disregard the image input and only generates responses based on the text portion.

kehanlu avatar Jul 24 '23 05:07 kehanlu

Hi, thanks for sharing this information with us. I think the possible reasons could be some incompatibility issues within the code. As I'm currently on traveling, I will look into it as soon as possible when travel is finished. Would you mind sending the code you used to my email: [email protected] for me to take a look?

lyuchenyang avatar Jul 24 '23 16:07 lyuchenyang

Hey @lyuchenyang , I have been experiencing the same issue during inference. Are there any updates on this? Thank you.

dbountouridis avatar Oct 30 '23 11:10 dbountouridis