OBELICS issues

nsfw filtered texts only file missing at step 08_01

1

Hi @HugoLaurencon ! Thanks for providing the code! Really helpful! I am trying to reproduce the pipeline, however I am having issues to find this file at [line](https://github.com/huggingface/OBELICS/blob/main/build_obelics/08_01_prepare_urldedup.py#L18C1-L18C113): ``` PATH_WEB_DOCS_S3...

shaharukhkhan4350

Is the tot_counter saved twice in this code snippe？

4

tot_counter = Counter() for counter in tqdm(all_counters): tot_counter.update(counter) with open("/scratch/tot_image_urls_in_web_document_dataset_filtered.pickle", "wb") as f: pickle.dump(tot_counter, f, pickle.HIGHEST_PROTOCOL) command_sync_s3 = ( "aws s3 cp /scratch/tot_image_urls_in_web_document_dataset_filtered.pickle" " s3://m4-datasets/webdocs/tot_image_urls_in_web_document_dataset_filtered.pickle" ) os.system(command_sync_s3) os.system(command_sync_s3) os.system(command_sync_s3) tot_image_urls_in_web_document_dataset_filtered_too_duplicated...

haiqiang2017

Releasing trained topic models?

1

Hey, thanks for the great work -- do you plan to release your trained LDA model for the analysis in sec 4.2? Thanks!

vishaal27

Missing TextMediaPairsExtractor from the repo

1

Hi, Can you share `TextMediaPairsExtractor` that you are referring in obelics/visualization/global_visualization.py?

kckishan

Search engine over the training data

1

Hello all, Thanks a lot for releasing this dataset. I was wondering whether you were planning to release any form of "search engine" over your dataset which is something similar...

aleSuglia

When will the trained model be released?

3

Hi there, thank you very much for this awesome project! I wonder whether you are going to release the model that is trained on this dataset in the near future....

chenxshuo

How to use LDA for topic modeling

1

Thanks for your work again! In the paper the topic modeling of OBELICS is implemented using LDA, and I am wondering what is the specific LDA model was used, what...

jrryzh

LDA

jrryzh

OBELICS
OBELICS copied to clipboard

Metadata

nsfw filtered texts only file missing at step 08_01

Is the tot_counter saved twice in this code snippe？

Releasing trained topic models?

Missing TextMediaPairsExtractor from the repo

Search engine over the training data

When will the trained model be released?

How to use LDA for topic modeling

LDA

← Metadata

Owner

Metadata

OBELICS OBELICS copied to clipboard

Metadata

← Metadata

Owner

Metadata

OBELICS
OBELICS copied to clipboard