open_flamingo
open_flamingo copied to clipboard
About the random selection during pre-training
Hi, I have noticed that for each sample (i.e. document) used in pre-training. It firstly reads all of the images within a sample. Then, the number of images is bounded by the maximum image. Therefore, it only utilizes the first several images, and the rest of them will never be used. Is it desirable property?
No you’re right this isn’t a perfect way of doing it. Ideally you should create multiple samples from a document if it is too long. Do you plan on adding this in a PR :)?
@anas-awadalla could help out here if it's still relevant, though not sure if I understand the issue properly.
Is the suggestion to only read images such that there are up to max_num_images
valid_images? Or that we should create extra samples with the surplus images? If the latter, how should we deal with the accompanying text in the new samples?
Hello! So currently what is going on is that we are "read[ing] images such that there are up to max_num_images valid_images". I think a better way to go about this would be to images/text until you surpass the max_num_images limit and then create a separate sample with the remaining images/text.
Got it, thanks. Not sure how many images there are usually in the samples. This seems to be effectively chunking/yielding a sample by max_num_images, and it looks like we can directly put a new method (similar to get_patches
in the example link) directly into the pipeline to support expanding samples.
[Update]: Or maybe I think we can just make preprocess_interleaved
yield samples.
Looking through the code, preprocess_gpt_interleaved
and preprocess_interleaved
share quite a bit of code as well, so I could clean that up too in the same PR.
Sweet! That would be awesome
Got a few questions in the PR before I make more changes 🙏