Ian Magnusson comments

Results 30 comments of


                                            Ian Magnusson

Baseline data

I tried something I just thought of to get some more info on debugging the decon issues: I tried running the decon pipeline using a copy of saved bloom filter...

Starting over from the top now with new Dolma version (commit 2ee1ae27f32c09531699301ef8271a6cb45da2da): ``` conda remove -n dolma-baselines --all aws s3 rm s3://ai2-llm/bloom-filters/perplexity-suite-v3_option2.bin ``` ## Setup Environment > Create a conda...

Baseline data

Okay I think the issue is that the old setup instructions had me installing the wrong wheels so here we go again but now with the right wheels. Starting over...

Baseline data

Next we're trying to tokenize ``` dolma tokens --documents "/mnt/tank/dolma_tmp/results/pile/v0_decon_ppl_suite_v3/*.json.gz" --destination /mnt/tank/dolma_tmp/preprocessed/pile/v0_decon_ppl_suite_v3/gpt-neox-20b-pii-special --tokenizer_name_or_path allenai/eleuther-ai-gpt-neox-20b-pii-special --processes 224 --seed 3920 ``` But this gets the following error: ``` Traceback (most recent call...

Baseline data

Now applying all this to RedPajama we get: ``` parallel --eta --bar "zcat {} | jq .attributes.bff_duplicate_paragraph_spans_decontamination | grep '\[\]'" ::: /mnt/tank/dolma_tmp/results/redpajama/v1/attributes/perplexity_suite_v3_option2/split=train/dataset=*/*.gz | wc -l parallel --eta --bar "zcat {}...

Baseline data

And now falcon: decon ``` dolma -c configs/baselines/decontamination/falcon-refinedweb.yaml dedupe ``` mix ``` dolma -c configs/baselines/mixing/falcon-refinedweb.json mix --processes 224 ``` check doc removal ``` aws s3 sync s3://ai2-llm/pretraining-data/sources/falcon-refinedweb/v0-0.05-heldout-complement/ /mnt/tank/dolma_tmp/results/falcon-refinedweb/v0-0.05-heldout-complement/ parallel --eta...

Baseline data

We're redoing Pile tokenization now cuz of a bug when tokenizing with more parallel processes than files in the dataset. We push a new config and run: ``` dolma -c...

Baseline data

Now lets do c4: ``` conda activate dolma-baselines-fixed export TMPDIR=/mnt/tank/tmp/ ``` This data is already deconned for Dolma, so we go right to check removal ``` aws s3 sync s3://ai2-llm/pretraining-data/sources/c4/v0/attributes/perplexity_suite_v3_option2/...

Baseline data

Now mc4: ``` conda activate dolma-baselines-fixed export TMPDIR=/mnt/tank/tmp/ ``` dedup ``` dolma -c configs/baselines/decontamination/mc4.yaml dedupe ``` Check removal ``` parallel --eta --bar "zcat {} | jq .attributes.bff_duplicate_paragraph_spans_decontamination | grep '\[\]'"...

Baseline data

Now we'll make a dolma-cc-only dataset. This just needs tokenization but for some reason it needs the code at main on afab18c9b4f48be9a4df27552afb79b6e2a2a745 ``` conda create -n dolma-main-latest python=3.10 conda activate...