dolma
dolma copied to clipboard
Running paragraph level deduplication on c4
I am trying to run paragraph level deduplication using the dolma library and wanted to test it on c4. I downloaded allenai/c4
from huggingface, updated the schema to be text (string, doc content), id (long, unique id), source ("c4")
, and saved it as json.gz
files that are ~250MB/file
. Any time I run dolma -c c4-dedupe.yaml dedupe
the output attribute is always an empty list. Here is the yaml
I am using (which is almost identical to the one provided at configs/dolma-v1_5/para_dedupe/c4.yaml
documents:
- /home/c4/v0/documents/*.gz
dedupe:
name: dedupe_paragraphs
paragraphs:
attribute_name: bff_duplicate_paragraph_spans
skip_empty: true
bloom_filter:
file: /tmp/c4.bloom
read_only: false
estimated_doc_count: 30000000000
desired_false_positive_rate: 1e-06
processes: 350
the machine I am using has 360 vCPU
and is running Debian 11, Python 3.10
. I tried using pip install dolma
and downloading the library directly from the repo (neither worked). I built a small example input as I saw in this discussion which worked totally fine. Pretty confused about this result.
I would really appreciate help / any thoughts why this might be the case.
uh, that is pretty confusing! could you post a sample of the data in your yaml file?
Were you able to resolve this? @andrewhojel