MMGL
MMGL copied to clipboard
Sample 600k pages?
When i read your code in preprocess_data.py, i am confused about the following code:
https://github.com/minjiyoon/MMGL/blob/21f97f713472c9e9e31c83f3627b15212f35fe48/wikiweb2m/preprocess_data.py#L191
I think you want to sample 600k pages, so it should be break in line 191 instead of continue?
Oh, you are right! I forgot to change it. I used it to explore the remaining part of the dataset.
I found some image urls became invalid. Could you please provide the complete image datasets?