LLaVA-NeXT
                                
                                 LLaVA-NeXT copied to clipboard
                                
                                    LLaVA-NeXT copied to clipboard
                            
                            
                            
                        Request for NExTQA Dataset Evaluation Prompt and More Results on Challenging Datasets for Fair Comparison
To my knowledge, the videos in NExTQA dataset are relatively short, with an average video length of 44 seconds, and there is a noted static bias[1] in the ActivityNet QA dataset. Could you present further results on more demanding datasets for fair comparison, such as EgoSchema[2]? Additionally, Could I request that you supply the evaluation prompt for the NeXTQA dataset?
[1] Lei, Jie et al. “Revealing Single Frame Bias for Video-and-Language Learning.” ArXiv abs/2206.03428 (2022): n. pag. [2] Mangalam, Karttikeya et al. “EgoSchema: A Diagnostic Benchmark for Very Long-form Video Language Understanding.” ArXiv abs/2308.09126 (2023): n. pag.
Thanks for your advise. The evaluation on the EgoSchema is ongoing.
The prompt for the NeXTQA is: Answer the question using several words or phrase.'