MMInstruct icon indicating copy to clipboard operation
MMInstruct copied to clipboard

How to use your data generation pipeline?

Open waltonfuture opened this issue 1 year ago • 1 comments

Thanks for your good work! Can you provide some guidance on how to use your data generation pipeline?

waltonfuture avatar Nov 07 '24 10:11 waltonfuture

Thank you for your attention to our work.

As described in our paper, we mainly proposed a semi-automatic and low-cost instruction generation data engine using GPT-4V, GPT-3.5 and manual correction.

Our data engine consists of six steps: (a) image collection, (b) image caption generation, (c) seed question collection, (d) automatic instruction generation, (e) dataset expansion and (f) manual correction.

(a) First, we collect a large number of different images from various sources, which are mainly obtained through some selected source images, and then retrieved by crawlers and clips, etc., as shown in image_retrieval_bing_spider.py and image_retrieval_clip.py.

(b) And use GPT-4V to generate detailed image captions, as shown in gpt4v_caption.py.

(c) Then experts designed corresponding seed questions for different fields.

(d) We use image captions and seed questions to automatically generate a rich and diverse set of instruction data through GPT-3.5, as shown in gpt35_qa.py.

(e), (f) In addition, we also use various methods to expand our dataset. Finally, manual correction is performed to ensure data quality and accuracy.

yuecao0119 avatar Nov 07 '24 11:11 yuecao0119