Megatron-DeepSpeed
Megatron-DeepSpeed copied to clipboard
Prefix LM Eval
This PR adapts evaluation to work with Prefix LMs, such as used for T0 finetuning experiments.
Using the normal eval harness I get the following results:
Using CHECKPOINT_PATH=$six_ALL_CCFRSCRATCH/checkpoints/tr11f-6B3-ml/checkpoints/main/global_step163750 (CKPT prior to MTF):
copa "acc": 0.58
Using CHECKPOINT_PATH=/gpfsscratch/rech/six/commun/checkpoints/tr13f-6B3-ml-t0/checkpoints/prefix/global_step2000:
copa "acc": 0.7
Using CHECKPOINT_PATH=/gpfsscratch/rech/six/commun/checkpoints/tr13f-6B3-ml-t0/checkpoints/prefix/global_step3100:
copa "acc": 0.67
Using CHECKPOINT_PATH=/gpfsscratch/rech/six/commun/checkpoints/tr13f-6B3-ml-t0/checkpoints/prefix/global_step3100 without --prefix:
copa "acc": 0.73
cc @lintangsutawika @haileyschoelkopf - can't add you as reviewers somehow, but would be great if you could take a look. I'm not 100% sure about the results I got 🧐
Will take a closer look.
@Muennighoff so the intended results is suppose to be that with Prefix-LM the performance should be higher, right? However, based on the scores you shared, this does not seem to be the case.
Yeah so according to the current results evaluating the model as a causallm is better than a prefixlm after it was fine-tuned as a prefixlm. Also note:
- In both cases it is better than prior to fine-tuning.
- There is no strong performance difference for the CD + CD & CD + ND models in the
What Language Model Architecture and Pretraining Objective Work Best for Zero-Shot Generalization?paper. I.e. for
CD:FLM (219B) + CD:MTF (13B)
CD:FLM (219B) + ND:MTF (13B