MMInstruct icon indicating copy to clipboard operation
MMInstruct copied to clipboard

about MMInstruct data

Open ChangGiMoon opened this issue 6 months ago • 5 comments

Thank you for sharing a useful dataset. Your paper says 973K instruction data, but the dataset published on Huggingface (https://huggingface.co/datasets/yuecao0119/MMInstruct-GPT4V) seems to be different. I counted the number of id in all jsonl files in the json_all folder and jsons_per_domain folder, and the result was 756372. I wonder if I calculated it wrong. Additionally, when using the MMInstruct dataset for instruction tuing purposes, I wonder if I can just merge the information in all jsonl files in the json_all folder and jsons_per_domain folder into one file and use it. I also wonder if I can use the corresponding image file by unzipping the images.zip file. Please reply.

ChangGiMoon avatar Jun 11 '25 05:06 ChangGiMoon

  1. Our instruction dataset also introduces some other open source datasets for expansion, please refer to: | Domain | Dataset | | -------------------- | ------------------------------------------------------------ | | mathematics datasets | GEOS; UniGeo; GeoQA+; Geometry3k; CLEVR-Math; Supre-CLEVR; TabMWP | | charts and plots | DVQA (100K); FigureQA | | scientific figure | TQA | | map chart | MapQA |

  2. No need to merge, the two data are the same.

yuecao0119 avatar Jun 11 '25 06:06 yuecao0119

@yuecao0119 Thanks for your quick reply.

  1. Does the json_all folder also contain the expanded instruction dataset?
  2. And I wonder why the caption_en.jsonl file also contains Chinese.
  3. If I want to create an English-only model, should I change all the Chinese in the jsons_all folder to English?

ChangGiMoon avatar Jun 13 '25 02:06 ChangGiMoon

  1. To avoid secondary distribution, json_all.jsonl does not include these datasets.
  2. To show that our data pipeline is compatible with multiple languages, we built two data sets in Chinese and English.
  3. The large model itself has language generalization capabilities, so the theory does not need to be translated. However, it can also be done, and it may be better.

yuecao0119 avatar Jun 13 '25 08:06 yuecao0119

@yuecao0119 Thanks for your quick reply.

  1. If I use MMInstruct to do Instruction Tuning for MLLM, do I have to include the expanded instruction dataset in MMInstruct?
  2. Are there any duplicates between the caption_cn.jsonl file and the caption_en.jsonl file? I think that if the images are the same, but in different languages, then changing Chinese to English could cause duplicates.

ChangGiMoon avatar Jun 17 '25 02:06 ChangGiMoon

Sorry for the late response.

For Q1, we also tried training exclusively on our dataset, without using the augmented dataset. The results showed a slight drop in performance on some benchmarks, such as medical images in MME. This is because our dataset doesn't include data from the corresponding domains, so our augmented dataset is more focused on enriching the domains. However, for general benchmarks that cover all domains, the performance impact was minimal.

For Q2, you're correct, so our data doesn't generate repeated captions for the same images.

yuecao0119 avatar Aug 08 '25 11:08 yuecao0119