OBELICS icon indicating copy to clipboard operation
OBELICS copied to clipboard

Is the tot_counter saved twice in this code snippe?

Open haiqiang2017 opened this issue 9 months ago • 4 comments

tot_counter = Counter() for counter in tqdm(all_counters): tot_counter.update(counter)

with open("/scratch/tot_image_urls_in_web_document_dataset_filtered.pickle", "wb") as f:
    pickle.dump(tot_counter, f, pickle.HIGHEST_PROTOCOL)

command_sync_s3 = (
    "aws s3 cp /scratch/tot_image_urls_in_web_document_dataset_filtered.pickle"
    " s3://m4-datasets/webdocs/tot_image_urls_in_web_document_dataset_filtered.pickle"
)
os.system(command_sync_s3)
os.system(command_sync_s3)
os.system(command_sync_s3)

tot_image_urls_in_web_document_dataset_filtered_too_duplicated = [
    k for k, v in tot_counter.items() if v > THRESHOLD_TOO_DUPLICATED
]

with open("/scratch/tot_image_urls_in_web_document_dataset_filtered_too_duplicated.pickle", "wb") as f:
    pickle.dump(tot_counter, f, pickle.HIGHEST_PROTOCOL)
   
   
   Is the tot_counter saved twice in this code snippet? And tot_image_urls_in_web_document_dataset_filtered_too_duplicated is not used,

haiqiang2017 avatar May 07 '24 11:05 haiqiang2017

From which file did you get this?

HugoLaurencon avatar May 10 '24 11:05 HugoLaurencon

[OBELICS]main/build_obelics/06_02_merge_sets_image_urls_in_webdocs.py @HugoLaurencon The code from here

haiqiang2017 avatar May 15 '24 03:05 haiqiang2017

Yes you should probably replace tot_counter by tot_image_urls_in_web_document_dataset_filtered_too_duplicated in the second occurrence

HugoLaurencon avatar May 15 '24 09:05 HugoLaurencon

thanks, I can solve the problem by this method. @HugoLaurencon

haiqiang2017 avatar May 20 '24 07:05 haiqiang2017