OLMo
OLMo copied to clipboard
Modeling, training, eval, and inference code for OLMo
Trying torch scripting and applying the rotations in the complex plane instead of R²
It is suspicious that we had two slightly different models (one with biases, one without), that both spiked at exactly the same moment. This suggests there might be a data...
What happens now === Our runs produce "checkpoint directories". You might have seen them. Checkpoint directories contain a bunch of debris from a run, including between 0 and n actual...
The problem is that on LUMI, FSDP doesn't overlap computation and communication like it should. Evidence comes from this profiler trace:  It may be noteworthy that the NCCL GPU...
- Does not yet support checkpointing - `configs/olmo-small-ablation-lumi-deepspeed.yaml` is the same as `configs/olmo-small-ablation-lumi.yaml` except for `deepspeed: true` & `init_device: cpu` - `scripts/lumi/olmo-small-ablation-on-lumi-test.sh` is the same as `scripts/lumi/olmo-small-ablation-on-lumi-test-deepspeed.sh` except for `export...
This is the `kebab` config, a smaller version of the `dirk` config. Differences from `dirk`: * untied weights * weight decay on everything * adjusted `mlp_hidden_size` so we come out...
Updating the Llama config to use Llama block and RoPE lower precision, to match the behavior of bf16-autocast Llama more closely.
## Update - 11/3/23 Mitch is a big fan of Z-loss. Currently they're running Z-loss, no weight tying, LR=1e-3, wd=0.1, QK norm. So with Z-loss (and potentially QK norm) it...