ParlAI How long should eval_model.py -t blended_skill_talk -m zoo/blender

When I run:

python parlai/scripts/eval_model.py -t blended_skill_talk \
 -mf zoo:blender/blender_90M/model --metrics ppl

I get output:

14:57:01 INFO | 0.5% complete (30 / 5,651), 0:00:10 elapsed, 0:32:10 eta
    accuracy  exs    f1  gpu_mem  loss  ppl  token_acc   tpb
           0   30 .1705   .01175 2.688 14.7      .4363 14.13
...
14:59:43 INFO | 8.6% complete (487 / 5,651), 0:02:52 elapsed, 0:30:29 eta
    accuracy  exs    f1  gpu_mem  loss   ppl  token_acc   tpb
           0  487 .1815   .01175 2.619 13.72      .4340 18.25

(so ~30 mins runtime).

When I run:

python parlai/scripts/eval_model.py -t blended_skill_talk -mf logs/bb12/v2 --metrics ppl

Evaluation finishes in 90 seconds, but has no accuracy column:

15:11:19 INFO | 8.9% complete (503 / 5,651), 0:00:10 elapsed, 0:01:42 eta
    exs  gpu_mem  loss   ppl  token_acc   tpb
    503    .1766 2.482 11.97      .4422 19.15
15:12:58 INFO | Finished evaluating tasks ['blended_skill_talk'] using datatype valid
    exs  gpu_mem  loss   ppl  token_acc   tpb
   5651    .1766 2.549 12.79      .4410 19.18

When I run:

python parlai/scripts/eval_model.py -t blended_skil
l_talk -mf zoo:blender/blender_3B/model --metrics ppl

The output suggests that eval will take 6.5 hrs. There is an accuracy column.

15:05:22 INFO | 0.4% complete (21 / 5,651), 0:01:29 elapsed, 6:40:14 eta
    accuracy  exs    f1  gpu_mem  loss   ppl  token_acc  tpb
           0   21 .1650    .3369 2.508 12.28      .4252   14

The model at logs/bb12/v2 is a version of blenderbot-3B (so only 128 tokens of history) with only 12 decoder layers. But I still would be surprised that evaluating this would be 30 times faster than 90M and ~1000 times faster than evaluating an identical model with 2x as many decoder layers. So I suspect the discrepancy may be caused by the accuracy column issue.

I also cleaned up the model_file to only have the 'model' key and it's value. Rather than optimizer, scheduler, etc, but I suspect that the difference lies in the presence of an accuracy column, and/or safety classifiers/models being run when the zoo: models are evaluated.

Other things I tried: adding --beam_delay 30 --skip_generation True --metrics default to all commands. No change in runtime.

Has anyone experienced this issue? Thanks!

Jul 13 '20 15:07 sshleifer

Yeah a couple things are messing with you:

eval_model "forgets" the batch size. So all your things are running with a batchsize of 1. Based on "gpu_mem", it looks like you could pump up the batchsize much higher.
Computing PPL/token accuracy doesn't require you to do beam search, only a single forward pass. Accuracy/f1 DO require you to do complete generation, which makes it MUCH slower.

So if you ONLY want ppl, you're best off using something like --batchsize 32 --skip-generation true. This will disable beam search and you should be able to evaluate the 90M model in a couple of minutes. If you DO want to do beam search and get full generations, then you can expect it to be (beam-size * average generation length) times slower.

Jul 13 '20 16:07 stephenroller

(Also --beam-delay only does anything if --inference delayedbeam is set)

Jul 13 '20 16:07 stephenroller

ALSO, I have this WIP PR that's very close but just needs some testing: https://github.com/facebookresearch/ParlAI/pull/2775. TLDR is that eval_model only uses one GPU, and the new PR fixes this.

Jul 13 '20 16:07 stephenroller

Awesome. That solves it!

This command runs in 20 seconds:

python parlai/scripts/eval_model.py -t blended_skill_talk \
 -mf zoo:blender/blender_90M/model --metrics ppl --batchsize 32 --skip-generation true

Excited for that PR.

Jul 13 '20 17:07 sshleifer

Closing this, but lemme know if you have further questions Sam. Cheers!

Jul 13 '20 17:07 stephenroller

Is --gpu -1 the correct option for multigpu? When I run --gpu -1 -bs 24 on an 8 GPU setup only 1 GPU gets used. When I run --gpu -1 -bs 128 it OOMs. Also is there a way to run eval only on a subset of the validation data?

Jul 14 '20 00:07 sshleifer

No, it will always use only 1 gpu for evaluation. If you use that PR, then you can use multiprocessing_eval with otherwise identical arguments and it will split the data across the 8 GPUs. (But batch size is still done per GPU).

As far as a subset of the data, you can use --num-examples to limit the number you evaluate on (as in, the first N). If you want to do something more complex, we would need to talk more about the details.

Jul 14 '20 03:07 stephenroller

I meant during training, sorry for being horribly unclear.

To rephrase, I am trying to (a) train on multiple GPUs (b) every val step (I know about --validation_every_n_epochs) run validation on the first 500 examples.

Jul 14 '20 10:07 sshleifer

Ah, a few tricks to speed up training:

There is parlai.scripts.multiprocessing_train. It behaves just like I described multiprocessing_eval above. Simply switch from calling python -m parlai.scripts.train_model to python -m parlai.scripts.multiprocessing_train. We also have distributed train if you happen to be on a SLURM cluster.
There is --validation-max-exs for limiting the size of the dataset during validation. I don't usually use it (validation data will be split across the 8 gpus, and I always train with --skip-generation true so i find it's quite fast, even when validation is 100k examples)
You can also use --dynamic-batching full to often get yourself a ~2.5x speedup., to group batches into similarly sized sentences and grow the batch size to maximize memory use. This option doesn't play well with --validation-max-exs though.
Of course, --fp16 true and having apex installed helps.

Jul 14 '20 12:07 stephenroller

Oh, and --eval-batchsize is also an option, to pump up the batchsize during validation, since you don't need activations/gradients.

Jul 14 '20 12:07 stephenroller

This is good, all of this should be turned into a docs page. Thanks for asking these great questiosn.

Jul 14 '20 12:07 stephenroller

Would be super useful.

The reason I couldn't figure this out was that

python examples/train_model.py --help | grep val
python examples/train_model.py --help | grep gpu

don't return the relevant options, because the stuff I was looking for isn't in Training Arguments group I guess.

One of us could also just check in an opinionated single_gpu.opt and multigpu.opt with those defaults and then unit-test that they don't cause warnings or have the word internal. The blenderbot* defaults are OK, but they cause scary warnings and 90M has internal:blended_skill_talk as the first task, which breaks. Also both seem to have duplicated logic across model.opt', model.dict.opt` and the 'overrides' key, which makes it hard to know where to edit stuff.

Jul 14 '20 13:07 sshleifer

Ah, the BlenderBot3 and 9 models have --model-parallel true which doesn't mix with multiprocessing_train, sorry. Those are trained as is, because they're so big lol.

Jul 14 '20 14:07 stephenroller

(Thinking about your proposal, it's interesting)

Jul 14 '20 14:07 stephenroller

This issue has not had activity in 30 days. Please feel free to reopen if you have more issues. You may apply the "never-stale" tag to prevent this from happening.

Aug 14 '20 00:08 github-actions[bot]

ParlAI
ParlAI copied to clipboard

How long should eval_model.py -t blended_skill_talk -m zoo/blender_90 take?

ParlAI ParlAI copied to clipboard

How long should eval_model.py -t blended_skill_talk -m zoo/blender_90 take?

ParlAI
ParlAI copied to clipboard