Open-Sora icon indicating copy to clipboard operation
Open-Sora copied to clipboard

Two questions about the generations of video captions

Open XuecWu opened this issue 10 months ago • 3 comments

Hi, Thank you for your great work! I have two questions about video caption generations in the provided tools/caption folder.

  1. The authors claim that they deploy the LLaVA-1.6-Yi-34B, I want to know where the concept of Yi model is reflected. What confuses me is that Yi model is the work of 01-ai.

  2. Due to the limitations of hardware environment, if the 34B specification model is not used, will there be an obvious degradation in caption quality?

Looking forward to to your reply. Best wishes,

XuecWu avatar Apr 09 '24 14:04 XuecWu

  1. Yi-34B is 01-ai's work: https://huggingface.co/01-ai/Yi-34B. LLaVA finetuned the model based on it.
  2. Recently we find using LLaVA 7B can achieve a relatively good result as Yi-34B. The reason we thought it was worse is that 7B model cannot follow complex instructions, and our prompt is too complex.

zhengzangw avatar Apr 14 '24 16:04 zhengzangw

  1. Yi-34B is 01-ai's work: https://huggingface.co/01-ai/Yi-34B. LLaVA finetuned the model based on it.
  2. Recently we find using LLaVA 7B can achieve a relatively good result as Yi-34B. The reason we thought it was worse is that 7B model cannot follow complex instructions, and our prompt is too complex.

Using llava-v1.6-7b,the output is mostly empty,this is what happens when you use LLaVA-1.6-Yi-34B? example: video caption: ['']

sunyclj avatar Apr 15 '24 08:04 sunyclj

  1. Yi-34B is 01-ai's work: https://huggingface.co/01-ai/Yi-34B. LLaVA finetuned the model based on it.
  2. Recently we find using LLaVA 7B can achieve a relatively good result as Yi-34B. The reason we thought it was worse is that 7B model cannot follow complex instructions, and our prompt is too complex.

Thank you for your reply!

XuecWu avatar Apr 16 '24 02:04 XuecWu