Open-Sora
Open-Sora copied to clipboard
Two questions about the generations of video captions
Hi, Thank you for your great work! I have two questions about video caption generations in the provided tools/caption folder.
-
The authors claim that they deploy the LLaVA-1.6-Yi-34B, I want to know where the concept of Yi model is reflected. What confuses me is that Yi model is the work of 01-ai.
-
Due to the limitations of hardware environment, if the 34B specification model is not used, will there be an obvious degradation in caption quality?
Looking forward to to your reply. Best wishes,
- Yi-34B is 01-ai's work: https://huggingface.co/01-ai/Yi-34B. LLaVA finetuned the model based on it.
- Recently we find using LLaVA 7B can achieve a relatively good result as Yi-34B. The reason we thought it was worse is that 7B model cannot follow complex instructions, and our prompt is too complex.
- Yi-34B is 01-ai's work: https://huggingface.co/01-ai/Yi-34B. LLaVA finetuned the model based on it.
- Recently we find using LLaVA 7B can achieve a relatively good result as Yi-34B. The reason we thought it was worse is that 7B model cannot follow complex instructions, and our prompt is too complex.
Using llava-v1.6-7b,the output is mostly empty,this is what happens when you use LLaVA-1.6-Yi-34B? example: video caption: ['']
- Yi-34B is 01-ai's work: https://huggingface.co/01-ai/Yi-34B. LLaVA finetuned the model based on it.
- Recently we find using LLaVA 7B can achieve a relatively good result as Yi-34B. The reason we thought it was worse is that 7B model cannot follow complex instructions, and our prompt is too complex.
Thank you for your reply!