bigscience Add t0 scripts

Notes:

RE: Learning Rate T0 & FLAN use Adafactor which automatically adjusts the step size: Finally, while the learning rate in Adam denotes a target absolute step size, we follow the intuition that relative change in the parameters is more relevant, so we propose scaling the size of the updates relative to the scale of the parameters themselves. Due to this scaling Adafactor may more resistent to higher learning rates and the step size adjusts automatically, so scheduling may be less needed (I.e. if you have weight decay with Adafactor, step size will automatically decay because parameters decay). For now I'm keeping a constant conservative LR of 1e-5, but we may want to instead go higher and add warmup + scheduling. Thoughts?

Jul 04 '22 09:07 Muennighoff

T0 leaves some HPs unspecified like Warmup, Weight Decay; Let's discuss them here?

No warmup as you have constant learning rate. No weight decay (will double check that one)

Currently, it would use 5% of the training set for validation.

If you can actually use the validation data from T0 then I'd say this is better.

Jul 04 '22 10:07 thomasw21

If you can actually use the validation data from T0 then I'd say this is better.

For that either a) Add a new arg like args.data_path that calls build_train_valid_test_datasets again with 100% valid split b) Concat train & valid sets, make on indexed dataset & the ratio such that they are separated again c) Use args.valid_weighted_split_paths & build_dataset_group, which doesn't work yet for MTF

I think a) or c) is best - Wdyt?

Jul 04 '22 11:07 Muennighoff

Probably need to use this: it's already implemented as an API https://github.com/bigscience-workshop/Megatron-DeepSpeed/blob/c5b88fb92d4417f77d729c95ce95e3a740b47065/megatron/arguments.py#L822-L840, I'll update the T0 branch to have that feature.

Jul 04 '22 12:07 thomasw21

Already merged via other PR

Nov 03 '22 17:11 Muennighoff

bigscience bigscience copied to clipboard

Add t0 scripts

bigscience
bigscience copied to clipboard