unilm
unilm copied to clipboard
Doubts about the MARIO-LAION dataset
Describe
Model I am using TextDiffuser:
I found that there are some index numbers starting with "50001" in the MARIO-LAION dataset, but I did not find the corresponding subfolder in the meta information (40G) file.
Same doubt here. Meanwhile, the data preparation requests
Please follow mario-laion-index-url.txt to move each image to the corresponding folders.
However, the mario-laion-index-url.txt
contains index-url pairs, and the downloaded images (using img2dataset) only have URLs in the JSON file.
Are we supposed to match images with indexes using URLs only?
Hello, there is another problem. I downloaded laion-ocr.zip before, but now it seems to be updated to laion-ocr-new.zip, and the size seems to be a little different. What is the specific content of the update, and is there any obvious impact?
I also have the same doubts. What is the reason for the disorder of the numbering with 50001 ? Is there any way to fix it? Hope the author can provide a clear reply.