MMGL icon indicating copy to clipboard operation
MMGL copied to clipboard

Sample 600k pages?

Open smurf-1119 opened this issue 1 year ago • 2 comments

When i read your code in preprocess_data.py, i am confused about the following code: https://github.com/minjiyoon/MMGL/blob/21f97f713472c9e9e31c83f3627b15212f35fe48/wikiweb2m/preprocess_data.py#L191

I think you want to sample 600k pages, so it should be break in line 191 instead of continue?

smurf-1119 avatar Feb 16 '24 15:02 smurf-1119

Oh, you are right! I forgot to change it. I used it to explore the remaining part of the dataset.

minjiyoon avatar Feb 18 '24 18:02 minjiyoon

I found some image urls became invalid. Could you please provide the complete image datasets?

smurf-1119 avatar Mar 04 '24 07:03 smurf-1119