Ian Magnusson
Ian Magnusson
I tried something I just thought of to get some more info on debugging the decon issues: I tried running the decon pipeline using a copy of saved bloom filter...
Starting over from the top now with new Dolma version (commit 2ee1ae27f32c09531699301ef8271a6cb45da2da): ``` conda remove -n dolma-baselines --all aws s3 rm s3://ai2-llm/bloom-filters/perplexity-suite-v3_option2.bin ``` ## Setup Environment > Create a conda...
Okay I think the issue is that the old setup instructions had me installing the wrong wheels so here we go again but now with the right wheels. Starting over...
Next we're trying to tokenize ``` dolma tokens --documents "/mnt/tank/dolma_tmp/results/pile/v0_decon_ppl_suite_v3/*.json.gz" --destination /mnt/tank/dolma_tmp/preprocessed/pile/v0_decon_ppl_suite_v3/gpt-neox-20b-pii-special --tokenizer_name_or_path allenai/eleuther-ai-gpt-neox-20b-pii-special --processes 224 --seed 3920 ``` But this gets the following error: ``` Traceback (most recent call...
Now applying all this to RedPajama we get: ``` parallel --eta --bar "zcat {} | jq .attributes.bff_duplicate_paragraph_spans_decontamination | grep '\[\]'" ::: /mnt/tank/dolma_tmp/results/redpajama/v1/attributes/perplexity_suite_v3_option2/split=train/dataset=*/*.gz | wc -l parallel --eta --bar "zcat {}...
And now falcon: decon ``` dolma -c configs/baselines/decontamination/falcon-refinedweb.yaml dedupe ``` mix ``` dolma -c configs/baselines/mixing/falcon-refinedweb.json mix --processes 224 ``` check doc removal ``` aws s3 sync s3://ai2-llm/pretraining-data/sources/falcon-refinedweb/v0-0.05-heldout-complement/ /mnt/tank/dolma_tmp/results/falcon-refinedweb/v0-0.05-heldout-complement/ parallel --eta...
We're redoing Pile tokenization now cuz of a bug when tokenizing with more parallel processes than files in the dataset. We push a new config and run: ``` dolma -c...
Now lets do c4: ``` conda activate dolma-baselines-fixed export TMPDIR=/mnt/tank/tmp/ ``` This data is already deconned for Dolma, so we go right to check removal ``` aws s3 sync s3://ai2-llm/pretraining-data/sources/c4/v0/attributes/perplexity_suite_v3_option2/...
Now mc4: ``` conda activate dolma-baselines-fixed export TMPDIR=/mnt/tank/tmp/ ``` dedup ``` dolma -c configs/baselines/decontamination/mc4.yaml dedupe ``` Check removal ``` parallel --eta --bar "zcat {} | jq .attributes.bff_duplicate_paragraph_spans_decontamination | grep '\[\]'"...
Now we'll make a dolma-cc-only dataset. This just needs tokenization but for some reason it needs the code at main on afab18c9b4f48be9a4df27552afb79b6e2a2a745 ``` conda create -n dolma-main-latest python=3.10 conda activate...