LLaVA-NeXT Question about M4-Instruct datasets

Thank your for your kindly release!

But when i looking at the annotations of M4-Instruct, the FIRST sample just quite confused me. Here is the snapshot:

The human and GPT value seem to be wrong. Obviously it should be "human value" first and giving an instruction with multiple images. But in this sample, instruction is given by GPT, and answer is given by human with images.

Looking forward to your reply.

Jun 29 '24 10:06 syspider

All the samples seem to have the same problem when the data source is "twitter_post"

Jun 29 '24 10:06 syspider

Thanks for pointing out this problem. The <image> prompt should be used in the value of gpt. We will fix it recently.

Jul 01 '24 21:07 FengLi-ust

Hi, Thank you for the release. I observed few more discrepancies in the json vs the data statistics in the blog.

There are only 7k (HQ-Edit) and 6.7k (MagicBrush) in the json compared to 50k and 14.2k in the blog.
Multi-image caption wasn't found in both the json and the zips.
Apart from the 25k samples from ScanQA, there are 50k samples from ScanNet in the json. But nothing from 3D-LLM.
There are only 5.7k from Twitter post in the json. And the other newly collected data for synthetic/real-world/COCO differences are missing.

Not sure if I missed something here. Would be great if you could kindly check.

Jul 02 '24 13:07 naraysa