LLaVA
LLaVA copied to clipboard
🐛 [BUG] llava-v1.6-mistral-7b fail to generate right response via 'mistral_instruct' template
Description
I write a inference script like this:
import torch
from PIL import Image
import sys
sys.path.append('./')
from llava.constants import IMAGE_TOKEN_INDEX, DEFAULT_IMAGE_TOKEN
from llava.conversation import conv_templates, SeparatorStyle
from llava.model.builder import load_pretrained_model
from llava.utils import disable_torch_init
from llava.mm_utils import tokenizer_image_token, get_model_name_from_path, KeywordsStoppingCriteria
def main():
disable_torch_init()
image = 'llava/serve/examples/extreme_ironing.jpg'
inp = 'What is unusual about this image?'
model_path = 'liuhaotian/llava-v1.6-mistral-7b'
conv_mode = sys.argv[1]
model_name = get_model_name_from_path(model_path)
tokenizer, model, processor, _ = load_pretrained_model(model_path, None, model_name)
conv = conv_templates[conv_mode].copy()
roles = conv.roles
image = Image.open(image)
image_tensor = processor.preprocess(image, return_tensors='pt')['pixel_values']
if type(image_tensor) is list:
tensor = [image.to(model.device, dtype=torch.float16) for image in image_tensor]
else:
tensor = image_tensor.to(model.device, dtype=torch.float16)
print(f"{roles[0]}: {inp}")
inp = DEFAULT_IMAGE_TOKEN + '\n' + inp
conv.append_message(conv.roles[0], inp)
conv.append_message(conv.roles[1], None)
prompt = conv.get_prompt()
input_ids = tokenizer_image_token(prompt, tokenizer, IMAGE_TOKEN_INDEX, return_tensors='pt').unsqueeze(0).cuda()
stop_str = conv.sep if conv.sep_style != SeparatorStyle.TWO else conv.sep2
keywords = [stop_str]
stopping_criteria = KeywordsStoppingCriteria(keywords, tokenizer, input_ids)
with torch.inference_mode():
output_ids = model.generate(
input_ids,
images=tensor,
do_sample=True,
temperature=0.2,
max_new_tokens=1024,
use_cache=True,
stopping_criteria=[stopping_criteria])
outputs = tokenizer.decode(output_ids[0, input_ids.shape[1]:]).strip()
print(outputs)
if __name__ == '__main__':
main()
If I conduct this command python inference.py mistral_instruct
, this code will generate empty output.
If I conduct this command python inference.py llava_v1
, this code will generate normal output:
city setting with traffic. It is also not typical to see someone standing on the back of a vehicle, as it can be dangerous and is generally not allowed. The man's actions are likely intended to be humorous or to draw attention to a specific cause or event. </s>
I tried to finetune llava-v1.6-mistral-7b with mistral_instruct
template, but the output was not in the expected format. Have you figured out what template llava-v1.6-mistral-7b uses?
I tried to finetune llava-v1.6-mistral-7b with
mistral_instruct
template, but the output was not in the expected format. Have you figured out what template llava-v1.6-mistral-7b uses?
Did you solve it? what version did you use in pretraining and finetuning btw?
I tried to finetune llava-v1.6-mistral-7b with
mistral_instruct
template, but the output was not in the expected format. Have you figured out what template llava-v1.6-mistral-7b uses?Did you solve it? what version did you use in pretraining and finetuning btw?
The codebase does not support llava 1.6 training and I didn't solve it, but I'm going to work on this in the coming days. I use the latest code in finetuning llava-v1.6-mistral-7b
I think llava-v1.6-mistral-7b model uses llava_llama_2 conversation template. You can try it out!