Macaw-LLM
Macaw-LLM copied to clipboard
Always have same response
Hi, I have loaded your pre-trained weights and tried some instructions. However, I found the model responded with the same answer no matter what image I gave.
model = MM_LLMs.from_pretrained(
"trained_model/mm_llms_trainer",
config = model_config,
)
model.eval()
# ...
instruction = "How many boats are in the picture?"
template = f"Below is an instruction that describes a task. Write a response that appropriately completes the request.\n\n### Instruction:\n{instruction}\n\n### Response:"
input_ids = tokenizer.encode(template.format(instruction))
eos_token_id = tokenizer.eos_token_id
if eos_token_id in input_ids:
input_ids.remove(eos_token_id)
input_ids = torch.tensor([input_ids], dtype=torch.int).to(device)
# image
# image = preprocess(Image.open("data/image_sample/COCO_train2014_000000492606.jpg"))
# image = preprocess(Image.open("data/image_sample/COCO_train2014_000000344896.jpg"))
image = preprocess(Image.open("data/image_sample/COCO_train2014_000000407061.jpg"))
image = image.unsqueeze(0)
with torch.no_grad():
bs = 1
inputs = {
"videos": None,
"images": image.half(),
"audios": None,
"input_ids": input_ids,
'image_starts': torch.tensor([tokenizer.convert_tokens_to_ids('<image>')] * bs, dtype=torch.int),
'image_ends': torch.tensor([tokenizer.convert_tokens_to_ids('</image>')] * bs, dtype=torch.int),
'audio_starts': torch.tensor([tokenizer.convert_tokens_to_ids('<audio>')] * bs, dtype=torch.int),
'audio_ends': torch.tensor([tokenizer.convert_tokens_to_ids('</audio>')] * bs, dtype=torch.int),
'video_starts': torch.tensor([tokenizer.convert_tokens_to_ids('<video>')] * bs, dtype=torch.int),
'video_ends': torch.tensor([tokenizer.convert_tokens_to_ids('</video>')] * bs, dtype=torch.int),
}
for k,v in inputs.items():
if v is not None:
inputs[k] = v.to(device)
inputs['inference'] = True
text_embeddings, attention_mask, labels, debug = model.prepare_inputs_for_generation(inputs)
print()
print(text_embeddings.size())
model_output = model.llm(inputs_embeds=text_embeddings, attention_mask=attention_mask, labels=labels)
generate_ids = model.llm.generate(inputs_embeds=text_embeddings, max_new_tokens=128, eos_token_id=2, bos_token_id=1, pad_token_id=32006)
Below is an instruction that describes a task. Write a response that appropriately completes the request.
### Instruction:
How many boats are in the picture?
### Response:
========================================
There are 5000 in the picture.
========================================
No matter what image I gave to the model. The model always replies There are 5000 in the picture.
with the same prompt. It seems the model just ignored any multi-modal inputs and replied based on text.
Did I do anything wrong? Thank you.
How did you get the tokenizer?
Regarding your problem, I think maybe it's because you are using model.llm, which is just the llama part? In this case, seems the whisper and clip part are not used.
From what I understand, we can run the model by:
model.eval()
with torch.no_grad():
generate_ids = model(data_item)
input_texts = TOKENIZER.batch_decode(data_item["input_ids"], skip_special_tokens=True, clean_up_tokenization_spaces=False)
generated_texts = TOKENIZER.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)
print(generated_texts)
Hi, thanks for sharing the infomation. We are currently checking it.
Hi @chatsci,
My code is modified from llm_trainer.py
and modeling.py
.
https://github.com/lyuchenyang/Macaw-LLM/blob/d03e59d24e12b97390ab687652977fc4407e537b/llm_trainer.py#L466-L489
https://github.com/lyuchenyang/Macaw-LLM/blob/d03e59d24e12b97390ab687652977fc4407e537b/modeling.py#L952-L963
I call the functions inside model()
forward to test it more easily. The function prepare_inputs_for_generation
will prepare the multi-modal tokens for the following LLM (encode the multi-modal features and concatenate with the text instruction).
I'm pretty sure that the input tokens for LLM contain image tokens. While conducting tests, I noticed that the model appears to disregard the image input and only generates responses based on the text portion.
Hi, thanks for sharing this information with us. I think the possible reasons could be some incompatibility issues within the code. As I'm currently on traveling, I will look into it as soon as possible when travel is finished. Would you mind sending the code you used to my email: [email protected] for me to take a look?
Hey @lyuchenyang , I have been experiencing the same issue during inference. Are there any updates on this? Thank you.