dolma
dolma copied to clipboard
Deduplication / Decontamination
Hi,
dolma is a wonderful tool, and I m successfully using it for many steps of my pipeline.
Strangely, I can manage to get it working for (paragraph-level) deduplication. When applied in a similar setting, for decontamination, however, it never assigns any attributes:
What is the problem?
Compared to the "normal" paragraph deduplication, when trying to just apply an existing bloom filter, there are no dedupe attributes in the resulting attribute files. I have already experimented with the desired_false_positive_rate
overlap_threshold
parameter, but without any success.
{"attributes":{"paragraphs_bff_duplicates":[]},"id":"URL1"}
{"attributes":{"paragraphs_bff_duplicates":[]},"id":"URL2"}
{"attributes":{"paragraphs_bff_duplicates":[]},"id":"URL3"}
Infos about my setup:
I am using the latest dolma 1.0.3 release. My latest minimum working example is based on configs/dolma-v1_5/decontamination.
Here are my config files
create-bloomfilter.yaml:documents:
- benchmarks.jsonl.gz # these are the files I want to filter with the decontamination step
dedupe:
name: decontaminate
paragraphs:
attribute_name: paragraphs_bff_duplicates
skip_empty: true
bloom_filter:
read_only: false
estimated_doc_count: 73543
#size_in_bytes: 104857 # 100 MB; smaller causes too many FPs
desired_false_positive_rate: 1e-3 # TOD: 1e-15
file: decontamination_bloom_filter.bin
processes: 4
decontaminate.yaml:
documents:
- tmp/v0/documents/*.gz
work_dir:
input: work/para/input
output: work/para/output
dedupe:
name: decontaminate
paragraphs:
attribute_name: paragraphs_bff_duplicates
skip_empty: true
bloom_filter:
read_only: true
estimated_doc_count: 288347
desired_false_positive_rate: 1e-3
file: decontamination_bloom_filter.bin
processes: 3
Here is the output
dolma -c create-bloomfilter.yaml dedupebloom_filter:
desired_false_positive_rate: 0.001
estimated_doc_count: 73543
file: decontamination_bloom_filter.bin
read_only: false
size_in_bytes: 0
dedupe:
min_length: 0
min_words: 0
name: decontaminate
paragraphs:
attribute_name: paragraphs_bff_duplicates
by_ngram:
ngram_length: 0
overlap_threshold: 1.0
skip_short_paragraphs: false
stride: 0
paragraph_separator: '
'
skip_empty: true
documents:
- benchmarks.jsonl.gz
processes: 4
work_dir:
input: /tmp/dolma-input-1rmq0gbx
output: /tmp/dolma-output-ky8van2k
[2024-06-27T12:34:26Z INFO dolma::bloom_filter] Loading bloom filter from "decontamination_bloom_filter.bin"...
[2024-06-27T12:34:26Z INFO dolma::deduper] Skipping "/disk/cschroeder/workspaces/dolma/benchmarks.jsonl.gz" because it already exists
[2024-06-27T12:34:26Z INFO dolma::deduper] Writing bloom filter to "decontamination_bloom_filter.bin"...
[2024-06-27T12:34:26Z INFO dolma::deduper] Bloom filter written.
[2024-06-27T12:34:26Z INFO dolma::deduper] Done!
dolma -c decontaminate.yaml dedupe
bloom_filter:
desired_false_positive_rate: 0.1
estimated_doc_count: 288347
file: decontamination_bloom_filter.bin
read_only: true
size_in_bytes: 0
dedupe:
min_length: 0
min_words: 0
name: decontaminate
paragraphs:
attribute_name: paragraphs_bff_duplicates
by_ngram:
ngram_length: 0
overlap_threshold: 1.0
skip_short_paragraphs: false
stride: 0
paragraph_separator: '
'
skip_empty: true
documents:
- tmp/v0/documents/*.gz
processes: 3
work_dir:
input: work/para/input
output: work/para/output
[2024-06-27T12:38:17Z INFO dolma::bloom_filter] Loading bloom filter from "decontamination_bloom_filter.bin"...
[2024-06-27T12:38:17Z INFO dolma::deduper] Writing attributes for tmp/v0/documents/part-0000.json.gz to tmp/v0/attributes/decontaminate/part-0000.json.gz.tmp
[2024-06-27T12:38:17Z INFO dolma::deduper] Writing attributes for tmp/v0/documents/part-0000.json.gz to tmp/v0/attributes/decontaminate/part-0000.json.gz.tmp
[2024-06-27T12:38:17Z INFO dolma::deduper] Writing attributes for tmp/v0/documents/part-0002.json.gz to tmp/v0/attributes/decontaminate/part-0002.json.gz.tmp
[2024-06-27T12:38:17Z INFO dolma::deduper] Writing attributes for tmp/v0/documents/part-0001.json.gz to tmp/v0/attributes/decontaminate/part-0001.json.gz.tmp
[2024-06-27T12:38:17Z INFO dolma::deduper] Writing attributes for tmp/v0/documents/part-0002.json.gz to tmp/v0/attributes/decontaminate/part-0002.json.gz.tmp
[2024-06-27T12:38:17Z INFO dolma::deduper] Writing attributes for tmp/v0/documents/part-0001.json.gz to tmp/v0/attributes/decontaminate/part-0001.json.gz.tmp
[2024-06-27T12:38:19Z INFO dolma::deduper] Keeping local file "tmp/v0/documents/part-0000.json.gz" after deduping...
[2024-06-27T12:38:19Z INFO dolma::deduper] Keeping local file "tmp/v0/documents/part-0001.json.gz" after deduping...
[2024-06-27T12:38:19Z INFO dolma::deduper] Writing attributes for tmp/v0/documents/part-0003.json.gz to tmp/v0/attributes/decontaminate/part-0003.json.gz.tmp
[2024-06-27T12:38:19Z INFO dolma::deduper] Writing attributes for tmp/v0/documents/part-0003.json.gz to tmp/v0/attributes/decontaminate/part-0003.json.gz.tmp
[2024-06-27T12:38:19Z INFO dolma::deduper] Writing attributes for tmp/v0/documents/part-0004.json.gz to tmp/v0/attributes/decontaminate/part-0004.json.gz.tmp
[2024-06-27T12:38:19Z INFO dolma::deduper] Writing attributes for tmp/v0/documents/part-0004.json.gz to tmp/v0/attributes/decontaminate/part-0004.json.gz.tmp
[2024-06-27T12:38:19Z INFO dolma::deduper] Keeping local file "tmp/v0/documents/part-0002.json.gz" after deduping...
[2024-06-27T12:38:22Z INFO dolma::deduper] Keeping local file "tmp/v0/documents/part-0003.json.gz" after deduping...
[2024-06-27T12:38:22Z INFO dolma::deduper] Keeping local file "tmp/v0/documents/part-0004.json.gz" after deduping...
[2024-06-27T12:38:22Z INFO dolma::deduper] Writing bloom filter to "decontamination_bloom_filter.bin"...
[2024-06-27T12:38:22Z INFO dolma::deduper] Bloom filter written.
[2024-06-27T12:38:22Z INFO dolma::deduper] Done!
Am I missing somehting?