mlx-vlm Fine-tuning labeling queries

Thank you for your outstanding work! My example data looks like this:

"messages": [
            {
                "role": "system",
                "content": "你是一个文博工作人员，请严谨认真地回答用户提出的问题"
            },
            {
                "role": "user",
                "content": "给我讲讲这个东西"
            },
            {
                "role": "assistant",
                "content": "# 文物名称\n神面纹玉戚\n# 时代\n新石器时代\n# 类别\n玉器\n# 收藏机构\n山西博物院\n# 尺寸\n高20.6厘米，宽13.1厘米，厚0.4厘米\n# 出土时间\n1964年\n# 出土地点\n黎城县后庄村广志山\n# 特征描述\n线刻，一侧为神人半侧面头像，头戴冠饰，披拂长发，眼眉清晰可辨；一侧为一方形台座上置神人冠饰的形象，神秘威严\n# 象征意义\n象征军事统帅权力"
            }
        ],
        "images": [
            "Qwen2-VL-History/images/instance_1579398113581395972.jpg"
        ]

after apply_chat_template:

examples["messages"]: [{'content': [{'type': 'text', 'text': '你是一个文博工作人员，请严谨认真地回答用户提出的问题'}], 'role': 'system'},
{'content': [{'type': 'text', 'text': '给我讲讲这个东西'}, {'type': 'image'}], 'role': 'user'},
{'content': [{'type': 'text', 'text': '# 文物名称\n神面纹玉戚\n# 时代\n新石器时代\n# 类别\n玉器\n# 收藏机构\n山西博物院\n# 尺寸\n高20.6厘米，宽13.1厘米，厚0.4厘米\n# 出土时间\n1964年...晰可辨；一侧为一方形台座上置神人冠饰的形象，神秘威严\n# 象征意义\n象征军事统帅权力'}], 'role': 'assistant'}]

The final prompt:

'<|im_start|>system\n你是一个文博工作人员，请严谨认真地回答用户提出的问题<|im_end|>\n<|im_start|>user\n给我讲讲这个东西<|vision_start|><|image_pad|><|vision_end|><|im_end|>\n<|im_start|>assistant\n# 文物名称\n神面纹玉戚\n# 时代\n新石器时代\n# 类别\n玉器\n# 收藏机构\n山西博物院\n# 尺寸\n高20.6厘米，宽13.1厘米，厚0.4厘米\n# 出土时间\n1964年\n# 出土地点\n黎城县后庄村广志山\n# 特征描述\n线刻，一侧为神人半侧面头像，头戴冠饰，披拂长发，眼眉清晰可辨；一侧为一方形台座上置神人冠饰的形象，神秘威严\n# 象征意义\n象征军事统帅权力<|im_end|>\n'

input_ids.shape: (1,509)

I found that in the loss_fn function,

labels = input_ids[:, 1:] 
input_ids = input_ids[:, :-1]

Why is the label like this? Is there something wrong with my data format? Is it possible to get correct results with this kind of training, without distinguishing between question and answer My loss has always been around 8.

After fine-tuning, my output looks like this:

# Load the model
model_path = "./model/Qwen/Qwen__Qwen2.5-VL-7B-Instruct"
adapter_path = "./outputs/Qwen2.5vl_lora_7b_bs1"
model, processor = load(model_path, adapter_path=adapter_path)
config = load_config(model_path)

# Prepare input
image = ["./Qwen2-VL-History/images/instance_1552220458182610945.jpg"]
messages = [{"role": "system", "content": "你是一个文博工作人员，请严谨认真地回答用户提出的问题"},
    {"role": "user", "content": "给我讲讲这个东西"}
]
formatted_prompt = apply_chat_template(
    processor, config, messages, num_images=len(image)
)

# Generate output
output = generate(model, processor, formatted_prompt, image, verbose=False)
print(output)

output:
('\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，，', {'input_tokens': 956, 'output_tokens': 256, 'total_tokens': 1212, 'prompt_tps': 309.8756828763058, 'generation_tps': 12.745029043710783, 'peak_memory': 17.891366498})

May 21 '25 12:05 Greywan

I also encountered the same problem, have you solved it?

Jul 01 '25 05:07 2657666247

it would be much easier if the docs provided the dataset format. how is the image saved in the hf dataset? PIL format? base64?

Jul 12 '25 16:07 msciancalepore98