OBELICS
OBELICS copied to clipboard
Code used for the creation of OBELICS, an open, massive and curated collection of interleaved image-text web documents, containing 141M documents, 115B text tokens and 353M images.
Hi @HugoLaurencon ! Thanks for providing the code! Really helpful! I am trying to reproduce the pipeline, however I am having issues to find this file at [line](https://github.com/huggingface/OBELICS/blob/main/build_obelics/08_01_prepare_urldedup.py#L18C1-L18C113): ``` PATH_WEB_DOCS_S3...
tot_counter = Counter() for counter in tqdm(all_counters): tot_counter.update(counter) with open("/scratch/tot_image_urls_in_web_document_dataset_filtered.pickle", "wb") as f: pickle.dump(tot_counter, f, pickle.HIGHEST_PROTOCOL) command_sync_s3 = ( "aws s3 cp /scratch/tot_image_urls_in_web_document_dataset_filtered.pickle" " s3://m4-datasets/webdocs/tot_image_urls_in_web_document_dataset_filtered.pickle" ) os.system(command_sync_s3) os.system(command_sync_s3) os.system(command_sync_s3) tot_image_urls_in_web_document_dataset_filtered_too_duplicated...
Hey, thanks for the great work -- do you plan to release your trained LDA model for the analysis in sec 4.2? Thanks!
Hi, Can you share `TextMediaPairsExtractor` that you are referring in obelics/visualization/global_visualization.py?
Hello all, Thanks a lot for releasing this dataset. I was wondering whether you were planning to release any form of "search engine" over your dataset which is something similar...
Hi there, thank you very much for this awesome project! I wonder whether you are going to release the model that is trained on this dataset in the near future....
Thanks for your work again! In the paper the topic modeling of OBELICS is implemented using LDA, and I am wondering what is the specific LDA model was used, what...