unilm icon indicating copy to clipboard operation
unilm copied to clipboard

Doubts about the MARIO-LAION dataset

Open scutyuanzhi opened this issue 1 year ago • 3 comments

Describe Model I am using TextDiffuser: I found that there are some index numbers starting with "50001" in the MARIO-LAION dataset, but I did not find the corresponding subfolder in the meta information (40G) file. image image

scutyuanzhi avatar Jul 26 '23 06:07 scutyuanzhi

Same doubt here. Meanwhile, the data preparation requests

Please follow mario-laion-index-url.txt to move each image to the corresponding folders.

However, the mario-laion-index-url.txt contains index-url pairs, and the downloaded images (using img2dataset) only have URLs in the JSON file. Are we supposed to match images with indexes using URLs only?

jwh97nn avatar Jul 26 '23 09:07 jwh97nn

Hello, there is another problem. I downloaded laion-ocr.zip before, but now it seems to be updated to laion-ocr-new.zip, and the size seems to be a little different. What is the specific content of the update, and is there any obvious impact?

scutyuanzhi avatar Jul 27 '23 03:07 scutyuanzhi

I also have the same doubts. What is the reason for the disorder of the numbering with 50001 ? Is there any way to fix it? Hope the author can provide a clear reply.

rardz avatar Feb 03 '24 16:02 rardz