bff
bff copied to clipboard
Is the deduplication scope separate or global when deduplicating multiple files?
Thanks for sharing the great codes!! They have been very useful for me!
I'm new to Rust and bloom filter and I have one question regarding the deduplication scope in your code -- I saw it runs let bloom_filter = bloom_filter.clone();
for each input file. Does this mean the bloom filter won't be synced across threads, i.e., the deduplication scope isn't global? I also wonder what is the best practice for me to run multi-thread processing if I have a very large pretraining corpus to process?
Appreciate your reply. Thank you!!