MiniGPT-4 icon indicating copy to clipboard operation
MiniGPT-4 copied to clipboard

How to calculate the number of data in the cc_sbu and laion respectively?

Open Richar-Du opened this issue 1 year ago • 2 comments

I download the cc_sbu dataset and count the number, I found that the total number is 12M and the success is more than 6M, which is impossible, since cc_sub+laion is just 5M as mentioned in your paper. Since webdataset is iterable dataloader, len is not implemented. I want to know how to calculate the number of data in the downloaded cc_sbu and laion?

Richar-Du avatar May 01 '23 02:05 Richar-Du

Hello! The whole dataset is large but we only use a small part of them. In our training setting for stage 1, we use 4 A100 80G, each of them has a batch size of 64. So the total batch size is 256. We train our model in the first stage for 20k steps. So the total data we consume in the first stage is 20k * 256 = 5.12M

TsuTikgiau avatar May 02 '23 08:05 TsuTikgiau

Thanks for your reply! So the first stage randomly sample the (image, caption) pairs from cc_sbu and laion dataset, the total number is calculated according to the training steps and batch size. WebDataset.Pipeline can guarantee that the sampled data are not repeated. Is it true?

Richar-Du avatar May 06 '23 07:05 Richar-Du