Open-Sora Two questions about the generations of video captions

Two questions about the generations of video captions

Open XuecWu opened this issue 10 months ago • 3 comments

Hi, Thank you for your great work! I have two questions about video caption generations in the provided tools/caption folder.

The authors claim that they deploy the LLaVA-1.6-Yi-34B, I want to know where the concept of Yi model is reflected. What confuses me is that Yi model is the work of 01-ai.
Due to the limitations of hardware environment, if the 34B specification model is not used, will there be an obvious degradation in caption quality?

Looking forward to to your reply. Best wishes,

Apr 09 '24 14:04 XuecWu

Yi-34B is 01-ai's work: https://huggingface.co/01-ai/Yi-34B. LLaVA finetuned the model based on it.
Recently we find using LLaVA 7B can achieve a relatively good result as Yi-34B. The reason we thought it was worse is that 7B model cannot follow complex instructions, and our prompt is too complex.

Apr 14 '24 16:04 zhengzangw

Yi-34B is 01-ai's work: https://huggingface.co/01-ai/Yi-34B. LLaVA finetuned the model based on it.

Recently we find using LLaVA 7B can achieve a relatively good result as Yi-34B. The reason we thought it was worse is that 7B model cannot follow complex instructions, and our prompt is too complex.

Using llava-v1.6-7b，the output is mostly empty，this is what happens when you use LLaVA-1.6-Yi-34B? example: video caption: ['']

Apr 15 '24 08:04 sunyclj

Yi-34B is 01-ai's work: https://huggingface.co/01-ai/Yi-34B. LLaVA finetuned the model based on it.

Recently we find using LLaVA 7B can achieve a relatively good result as Yi-34B. The reason we thought it was worse is that 7B model cannot follow complex instructions, and our prompt is too complex.

Thank you for your reply!

Apr 16 '24 02:04 XuecWu

Open-Sora Open-Sora copied to clipboard

Two questions about the generations of video captions

Open-Sora
Open-Sora copied to clipboard