LLaVA icon indicating copy to clipboard operation
LLaVA copied to clipboard

[Question] question about the pretrain data

Open guozhiyao opened this issue 2 years ago • 7 comments
trafficstars

Question

I saw that you provided two pre-training data CC-3M Concept-balanced 595K and LAION/CC/SBU BLIP-Caption Concept-balanced 558K, what is the difference between these two data? Which are you using? In addition, LAION/CC/SBU BLIP-Caption Concept-balanced 558K did not provide the corresponding metadata, could you provide it please?

guozhiyao avatar May 06 '23 07:05 guozhiyao

Hi @guozhiyao, thank you for the reminder. I have uploaded the metadata for LCS-558K here.

Regarding the difference, LAION/CC/SBU is a much larger dataset than CC-3M, and has a wider concept coverage. For example, CC-3M has intentionally filtered out the names of celebrities.

By perform concept-balanced filtering, we are able to maintain a similar size of pretrained dataset, while allowing a much wider concept coverage during the pretraining stage.

Thanks.

haotian-liu avatar May 07 '23 02:05 haotian-liu

Hi @haotian-liu . Could you please provide the images.zip of LAION/CC/SBU BLIP-Caption Concept-balanced 558K like CC-3M Concept-balanced 595K?

guozhiyao avatar May 09 '23 02:05 guozhiyao

Hi @haotian-liu . Could you please provide the images.zip of LAION/CC/SBU BLIP-Caption Concept-balanced 558K like CC-3M Concept-balanced 595K?

same question, Could you please provide the images.zip of LAION/CC/SBU BLIP-Caption Concept-balanced 558K like CC-3M Concept-balanced 595K?

cnxupupup avatar May 16 '23 12:05 cnxupupup

Hi @guozhiyao @cnxupupup

Hi, thank you for your interest in our work. We have uploaded the images here.

Important notice: Upon the request from the community, as ~15% images of the original LAION/CC/SBU dataset are no longer accessible, we upload images.zip for better reproducing our work in research community. It should not be used for any other purpose. The use of these images must comply with the LAION/CC/SBU license. This may be taken down when requested by the original LAION/CC/SBU dataset owner or owners of the referenced images.

haotian-liu avatar May 25 '23 19:05 haotian-liu

Hi @guozhiyao, thank you for the reminder. I have uploaded the metadata for LCS-558K here.

Regarding the difference, LAION/CC/SBU is a much larger dataset than CC-3M, and has a wider concept coverage. For example, CC-3M has intentionally filtered out the names of celebrities.

By perform concept-balanced filtering, we are able to maintain a similar size of pretrained dataset, while allowing a much wider concept coverage during the pretraining stage.

Thanks.

Hi @haotian-liu . Can you disclose specific filtering strategies for pretrain dataset?

TyRantLQlyf avatar Oct 17 '23 03:10 TyRantLQlyf

@haotian-liu Please confirm if these datasets are available for commercial use.

ChintanShahDS avatar Jan 10 '24 04:01 ChintanShahDS