dolma icon indicating copy to clipboard operation
dolma copied to clipboard

Running paragraph level deduplication on c4

Open andrewhojel opened this issue 2 months ago • 1 comments

I am trying to run paragraph level deduplication using the dolma library and wanted to test it on c4. I downloaded allenai/c4 from huggingface, updated the schema to be text (string, doc content), id (long, unique id), source ("c4"), and saved it as json.gz files that are ~250MB/file. Any time I run dolma -c c4-dedupe.yaml dedupe the output attribute is always an empty list. Here is the yaml I am using (which is almost identical to the one provided at configs/dolma-v1_5/para_dedupe/c4.yaml

documents:
  - /home/c4/v0/documents/*.gz

dedupe:
  name: dedupe_paragraphs
  paragraphs:
    attribute_name: bff_duplicate_paragraph_spans
  skip_empty: true

bloom_filter:
  file: /tmp/c4.bloom
  read_only: false
  estimated_doc_count: 30000000000
  desired_false_positive_rate: 1e-06

processes: 350

the machine I am using has 360 vCPU and is running Debian 11, Python 3.10. I tried using pip install dolma and downloading the library directly from the repo (neither worked). I built a small example input as I saw in this discussion which worked totally fine. Pretty confused about this result.

I would really appreciate help / any thoughts why this might be the case.

andrewhojel avatar Apr 20 '24 00:04 andrewhojel