Emotion-LLaMA
Emotion-LLaMA copied to clipboard
What exactly does minigptv2_checkpoint.pth do in Emotion-LLaMA?
Hi! First of all, thanks for the amazing work on Emotion-LLaMA and MiniGPT-v2 — really impressive and inspiring!
I’m currently fine-tuning Emotion-LLaMA with a custom emotion dataset using a fine-tuned version of LLaMA-2 or 3, and I’ve been testing a few different configurations.
Here’s what I found:
If I remove the minigptv2_checkpoint.pth (i.e., set ckpt: null) and fine-tune using my own LLaMA-2 model, the model trains, but the inference results are broken — it just repeats the same sentence or outputs incoherent text.
Even when I use the original LLaMA-2 model but exclude the minigptv2 checkpoint, I get the same issue: the model doesn't follow instructions and generates repetitive output.
So I’m wondering:
What exactly does minigptv2_checkpoint.pth contain?
Does it hold something essential like alignment between <VideoHere>, <FeatureHere> tokens and the language model?
Are LoRA layers involved in a way that makes this checkpoint necessary?
Is it possible to start from scratch without this ckpt and still get good results?
Any insight or clarification would be super helpful. Thanks again for making this project publicly available!
ps. We will probably not use minigpt's features like object detection and task identifier, and will only use llm's features that answer emotion classification and reasoning.
Hi! Thank you for your thoughtful question and for your interest in Emotion-LLaMA!
MiniGPT-v2 is a powerful foundation multimodal model capable of interpreting visual inputs to answer questions. Emotion-LLaMA builds upon MiniGPT-v2 by introducing additional encoders such as the Audio Encoder and others to support multimodal emotion reasoning.
During training, we load minigptv2_checkpoint.pth to leverage MiniGPT-v2’s prior knowledge of visual understanding. Even though Emotion-LLaMA doesn’t explicitly use features like object detection or task identifiers, these capabilities implicitly help the model localize faces, interpret scenes, and understand character relationships—all of which are crucial for emotion recognition and reasoning.
Specifically, minigptv2_checkpoint.pth includes:
- The mapping layers that project visual features into visual tokens.
- The LoRA-adapted weights for LLaMA-2, which enable the language model to interpret these visual tokens meaningfully.
If you remove minigptv2_checkpoint.pth, your LLaMA-2 model effectively becomes a pure text-only model—it has no ability to understand or process visual inputs. This is why your outputs become repetitive or incoherent: the model simply has no grounding in the visual modality.
If you want to train from scratch without this checkpoint, you'll need to pretrain the vision-language alignment module yourself, which can be very resource-intensive.
That said, you're absolutely free to swap out the base model. In fact, newer and stronger multimodal models such as LLaVA-1.5/1.6, LLaMA-3.2V, Qwen-VL-2.5, or Qwen-Omni may offer better performance and more flexibility depending on your use case.
Let me know if you’d like help adapting Emotion-LLaMA to one of these newer backbones!