LLaVA [Question] question about the pretrain data

trafficstars

Question

I saw that you provided two pre-training data CC-3M Concept-balanced 595K and LAION/CC/SBU BLIP-Caption Concept-balanced 558K, what is the difference between these two data? Which are you using? In addition, LAION/CC/SBU BLIP-Caption Concept-balanced 558K did not provide the corresponding metadata, could you provide it please?

May 06 '23 07:05 guozhiyao

Hi @guozhiyao, thank you for the reminder. I have uploaded the metadata for LCS-558K here.

Regarding the difference, LAION/CC/SBU is a much larger dataset than CC-3M, and has a wider concept coverage. For example, CC-3M has intentionally filtered out the names of celebrities.

By perform concept-balanced filtering, we are able to maintain a similar size of pretrained dataset, while allowing a much wider concept coverage during the pretraining stage.

Thanks.

May 07 '23 02:05 haotian-liu

Hi @haotian-liu . Could you please provide the images.zip of LAION/CC/SBU BLIP-Caption Concept-balanced 558K like CC-3M Concept-balanced 595K?

May 09 '23 02:05 guozhiyao

Hi @haotian-liu . Could you please provide the images.zip of LAION/CC/SBU BLIP-Caption Concept-balanced 558K like CC-3M Concept-balanced 595K?

same question, Could you please provide the images.zip of LAION/CC/SBU BLIP-Caption Concept-balanced 558K like CC-3M Concept-balanced 595K?

May 16 '23 12:05 cnxupupup

Hi @guozhiyao @cnxupupup

Hi, thank you for your interest in our work. We have uploaded the images here.

Important notice: Upon the request from the community, as ~15% images of the original LAION/CC/SBU dataset are no longer accessible, we upload images.zip for better reproducing our work in research community. It should not be used for any other purpose. The use of these images must comply with the LAION/CC/SBU license. This may be taken down when requested by the original LAION/CC/SBU dataset owner or owners of the referenced images.

May 25 '23 19:05 haotian-liu

Hi @guozhiyao, thank you for the reminder. I have uploaded the metadata for LCS-558K here.

Regarding the difference, LAION/CC/SBU is a much larger dataset than CC-3M, and has a wider concept coverage. For example, CC-3M has intentionally filtered out the names of celebrities.

By perform concept-balanced filtering, we are able to maintain a similar size of pretrained dataset, while allowing a much wider concept coverage during the pretraining stage.

Thanks.

Hi @haotian-liu . Can you disclose specific filtering strategies for pretrain dataset?

Oct 17 '23 03:10 TyRantLQlyf

@haotian-liu Please confirm if these datasets are available for commercial use.

Jan 10 '24 04:01 ChintanShahDS

LLaVA LLaVA copied to clipboard

[Question] question about the pretrain data

Question

LLaVA
LLaVA copied to clipboard