bff icon indicating copy to clipboard operation
bff copied to clipboard

Is the deduplication scope separate or global when deduplicating multiple files?

Open RulinShao opened this issue 7 months ago • 1 comments

Thanks for sharing the great codes!! They have been very useful for me!

I'm new to Rust and bloom filter and I have one question regarding the deduplication scope in your code -- I saw it runs let bloom_filter = bloom_filter.clone(); for each input file. Does this mean the bloom filter won't be synced across threads, i.e., the deduplication scope isn't global? I also wonder what is the best practice for me to run multi-thread processing if I have a very large pretraining corpus to process?

Appreciate your reply. Thank you!!

RulinShao avatar Nov 26 '23 21:11 RulinShao