LLaVA
LLaVA copied to clipboard
[Question] question about the pretrain data
Question
I saw that you provided two pre-training data CC-3M Concept-balanced 595K and LAION/CC/SBU BLIP-Caption Concept-balanced 558K, what is the difference between these two data? Which are you using? In addition, LAION/CC/SBU BLIP-Caption Concept-balanced 558K did not provide the corresponding metadata, could you provide it please?
Hi @guozhiyao, thank you for the reminder. I have uploaded the metadata for LCS-558K here.
Regarding the difference, LAION/CC/SBU is a much larger dataset than CC-3M, and has a wider concept coverage. For example, CC-3M has intentionally filtered out the names of celebrities.
By perform concept-balanced filtering, we are able to maintain a similar size of pretrained dataset, while allowing a much wider concept coverage during the pretraining stage.
Thanks.
Hi @haotian-liu . Could you please provide the images.zip of LAION/CC/SBU BLIP-Caption Concept-balanced 558K like CC-3M Concept-balanced 595K?
Hi @haotian-liu . Could you please provide the
images.zipofLAION/CC/SBU BLIP-Caption Concept-balanced 558KlikeCC-3M Concept-balanced 595K?
same question, Could you please provide the images.zip of LAION/CC/SBU BLIP-Caption Concept-balanced 558K like CC-3M Concept-balanced 595K?
Hi @guozhiyao @cnxupupup
Hi, thank you for your interest in our work. We have uploaded the images here.
Important notice: Upon the request from the community, as ~15% images of the original LAION/CC/SBU dataset are no longer accessible, we upload images.zip for better reproducing our work in research community. It should not be used for any other purpose. The use of these images must comply with the LAION/CC/SBU license. This may be taken down when requested by the original LAION/CC/SBU dataset owner or owners of the referenced images.
Hi @guozhiyao, thank you for the reminder. I have uploaded the metadata for LCS-558K here.
Regarding the difference, LAION/CC/SBU is a much larger dataset than CC-3M, and has a wider concept coverage. For example, CC-3M has intentionally filtered out the names of celebrities.
By perform concept-balanced filtering, we are able to maintain a similar size of pretrained dataset, while allowing a much wider concept coverage during the pretraining stage.
Thanks.
Hi @haotian-liu . Can you disclose specific filtering strategies for pretrain dataset?
@haotian-liu Please confirm if these datasets are available for commercial use.