Pete Walsh

Results 311 comments of Pete Walsh

@ayushbits we'll probably write about this in our upcoming "training details" paper. In the meantime I can give you my opinion, but take this with a grain a salt since...

Marking as blocked again because it doesn't appear to work properly on AMD GPUs. See #260.

Differences between our most recent run and Mitchell's: - They all-reduce gradients in fp32 (#291). Jury is still out on this one. MosaicML does all-reduce in bf16 like us. -...

I'm doing a run on LUMI (`mitch-ish`) that matches Mitchell's config as close as possible within our own code using #302: ```bash sbatch scripts/v1-mix-medium-on-lumi.sh \ --model.init_fn=mitchell \ --fsdp.precision=mixed \ --scheduler.t_warmup=2000...

I believe our data loading already has a deterministic order (given a seed) that's independent of the number of workers/nodes. What's NOT deterministic at the moment is our preprocessing/tokenization script...

> Another approach we can take is have the order of the documents set while its all still in the `.jsonl` format with full text strings, and then make `scripts/prepare_memmap_dataset.py`...

> When a document is being concatenated to one training instance but it's too large to fit in the context window, do we throw out the overflow or do we...

I'm guessing we could handle that through a callback. E.g. have a callback that creates a hardlink with a specific name for each checkpoint we want to keep, like "final_checkpoint_batch10000.torch`...

So far so spikey ![Image](https://github.com/allenai/LLM/assets/8812459/d639fc57-3218-4fc2-a5ce-39e0241963e8) I have another job queued that runs the same thing except with fp32 all-reduce. We'll see if that helps.

What were you trying to debug? Sure, we could make the training run on a single CPU (or GPU) but that adds complexity and new code paths. E.g. we can't...