LLaMA-VID
LLaMA-VID copied to clipboard
About the json in stage2 and stage3
Why does the data in stage2 and 3 contains pure text Q&A without images or videos?
According to DeepSeek-VL,
Maintaining a significant proportion of language data—specifically, at least 70%—is essential to preserve the integrity of language knowledge within the model.