Megatron-DeepSpeed icon indicating copy to clipboard operation
Megatron-DeepSpeed copied to clipboard

Prefix LM Eval

Open Muennighoff opened this issue 3 years ago • 4 comments

This PR adapts evaluation to work with Prefix LMs, such as used for T0 finetuning experiments.

Using the normal eval harness I get the following results:

Using CHECKPOINT_PATH=$six_ALL_CCFRSCRATCH/checkpoints/tr11f-6B3-ml/checkpoints/main/global_step163750 (CKPT prior to MTF): copa "acc": 0.58

Using CHECKPOINT_PATH=/gpfsscratch/rech/six/commun/checkpoints/tr13f-6B3-ml-t0/checkpoints/prefix/global_step2000: copa "acc": 0.7

Using CHECKPOINT_PATH=/gpfsscratch/rech/six/commun/checkpoints/tr13f-6B3-ml-t0/checkpoints/prefix/global_step3100: copa "acc": 0.67

Using CHECKPOINT_PATH=/gpfsscratch/rech/six/commun/checkpoints/tr13f-6B3-ml-t0/checkpoints/prefix/global_step3100 without --prefix: copa "acc": 0.73

Muennighoff avatar Jul 16 '22 14:07 Muennighoff

cc @lintangsutawika @haileyschoelkopf - can't add you as reviewers somehow, but would be great if you could take a look. I'm not 100% sure about the results I got 🧐

Muennighoff avatar Jul 16 '22 14:07 Muennighoff

Will take a closer look.

lintangsutawika avatar Jul 16 '22 15:07 lintangsutawika

@Muennighoff so the intended results is suppose to be that with Prefix-LM the performance should be higher, right? However, based on the scores you shared, this does not seem to be the case.

lintangsutawika avatar Jul 22 '22 03:07 lintangsutawika

Yeah so according to the current results evaluating the model as a causallm is better than a prefixlm after it was fine-tuned as a prefixlm. Also note:

  • In both cases it is better than prior to fine-tuning.
  • There is no strong performance difference for the CD + CD & CD + ND models in the What Language Model Architecture and Pretraining Objective Work Best for Zero-Shot Generalization? paper. I.e. for
CD:FLM (219B) + CD:MTF (13B)
CD:FLM (219B) + ND:MTF (13B

Muennighoff avatar Jul 22 '22 07:07 Muennighoff