Pete Walsh comments

Results 311 comments of


                                            Pete Walsh

Linear decay instead of cosine

@ayushbits we'll probably write about this in our upcoming "training details" paper. In the meantime I can give you my opinion, but take this with a grain a salt since...

Investigate triton

Marking as blocked again because it doesn't appear to work properly on AMD GPUs. See #260.

Debugging the Spikyness of the 7B

Differences between our most recent run and Mitchell's: - They all-reduce gradients in fp32 (#291). Jury is still out on this one. MosaicML does all-reduce in bf16 like us. -...

Debugging the Spikyness of the 7B

I'm doing a run on LUMI (`mitch-ish`) that matches Mitchell's config as close as possible within our own code using #302: ```bash sbatch scripts/v1-mix-medium-on-lumi.sh \ --model.init_fn=mitchell \ --fsdp.precision=mixed \ --scheduler.t_warmup=2000...

Ensuring Data Order Tracking for Reproducibility

I believe our data loading already has a deterministic order (given a seed) that's independent of the number of workers/nodes. What's NOT deterministic at the moment is our preprocessing/tokenization script...

Ensuring Data Order Tracking for Reproducibility

> Another approach we can take is have the order of the documents set while its all still in the `.jsonl` format with full text strings, and then make `scripts/prepare_memmap_dataset.py`...

Ensuring Data Order Tracking for Reproducibility

> When a document is being concatenated to one training instance but it's too large to fit in the context window, do we throw out the overflow or do we...

Checkpoint saving code needs to never delete or overwrite certain checkpoints

I'm guessing we could handle that through a callback. E.g. have a callback that creates a hardlink with a specific name for each checkpoint we want to keep, like "final_checkpoint_batch10000.torch`...

Another init: stddev 0.006, truncated to 3 stddevs

So far so spikey ![Image](https://github.com/allenai/LLM/assets/8812459/d639fc57-3218-4fc2-a5ce-39e0241963e8) I have another job queued that runs the same thing except with fp32 all-reduce. We'll see if that helps.

We can no longer run on a single CPU

What were you trying to debug? Sure, we could make the training run on a single CPU (or GPU) but that adds complexity and new code paths. E.g. we can't...