ParlAI
ParlAI copied to clipboard
How long should eval_model.py -t blended_skill_talk -m zoo/blender_90 take?
When I run:
python parlai/scripts/eval_model.py -t blended_skill_talk \
-mf zoo:blender/blender_90M/model --metrics ppl
I get output:
14:57:01 INFO | 0.5% complete (30 / 5,651), 0:00:10 elapsed, 0:32:10 eta
accuracy exs f1 gpu_mem loss ppl token_acc tpb
0 30 .1705 .01175 2.688 14.7 .4363 14.13
...
14:59:43 INFO | 8.6% complete (487 / 5,651), 0:02:52 elapsed, 0:30:29 eta
accuracy exs f1 gpu_mem loss ppl token_acc tpb
0 487 .1815 .01175 2.619 13.72 .4340 18.25
(so ~30 mins runtime).
When I run:
python parlai/scripts/eval_model.py -t blended_skill_talk -mf logs/bb12/v2 --metrics ppl
Evaluation finishes in 90 seconds, but has no accuracy column:
15:11:19 INFO | 8.9% complete (503 / 5,651), 0:00:10 elapsed, 0:01:42 eta
exs gpu_mem loss ppl token_acc tpb
503 .1766 2.482 11.97 .4422 19.15
15:12:58 INFO | Finished evaluating tasks ['blended_skill_talk'] using datatype valid
exs gpu_mem loss ppl token_acc tpb
5651 .1766 2.549 12.79 .4410 19.18
When I run:
python parlai/scripts/eval_model.py -t blended_skil
l_talk -mf zoo:blender/blender_3B/model --metrics ppl
The output suggests that eval will take 6.5 hrs. There is an accuracy column.
15:05:22 INFO | 0.4% complete (21 / 5,651), 0:01:29 elapsed, 6:40:14 eta
accuracy exs f1 gpu_mem loss ppl token_acc tpb
0 21 .1650 .3369 2.508 12.28 .4252 14
The model at logs/bb12/v2 is a version of blenderbot-3B (so only 128 tokens of history) with only 12 decoder layers. But I still would be surprised that evaluating this would be 30 times faster than 90M and ~1000 times faster than evaluating an identical model with 2x as many decoder layers. So I suspect the discrepancy may be caused by the accuracy column issue.
I also cleaned up the model_file to only have the 'model' key and it's value. Rather than optimizer, scheduler, etc, but I suspect that the difference lies in the presence of an accuracy column, and/or safety classifiers/models being run when the zoo: models are evaluated.
Other things I tried: adding --beam_delay 30 --skip_generation True --metrics default to all commands. No change in runtime.
Has anyone experienced this issue? Thanks!
Yeah a couple things are messing with you:
- eval_model "forgets" the batch size. So all your things are running with a batchsize of 1. Based on "gpu_mem", it looks like you could pump up the batchsize much higher.
- Computing PPL/token accuracy doesn't require you to do beam search, only a single forward pass. Accuracy/f1 DO require you to do complete generation, which makes it MUCH slower.
So if you ONLY want ppl, you're best off using something like --batchsize 32 --skip-generation true. This will disable beam search and you should be able to evaluate the 90M model in a couple of minutes. If you DO want to do beam search and get full generations, then you can expect it to be (beam-size * average generation length) times slower.
(Also --beam-delay only does anything if --inference delayedbeam is set)
ALSO, I have this WIP PR that's very close but just needs some testing: https://github.com/facebookresearch/ParlAI/pull/2775. TLDR is that eval_model only uses one GPU, and the new PR fixes this.
Awesome. That solves it!
This command runs in 20 seconds:
python parlai/scripts/eval_model.py -t blended_skill_talk \
-mf zoo:blender/blender_90M/model --metrics ppl --batchsize 32 --skip-generation true
Excited for that PR.
Closing this, but lemme know if you have further questions Sam. Cheers!
Is --gpu -1 the correct option for multigpu?
When I run --gpu -1 -bs 24 on an 8 GPU setup only 1 GPU gets used. When I run --gpu -1 -bs 128 it OOMs.
Also is there a way to run eval only on a subset of the validation data?
No, it will always use only 1 gpu for evaluation. If you use that PR, then you can use multiprocessing_eval with otherwise identical arguments and it will split the data across the 8 GPUs. (But batch size is still done per GPU).
As far as a subset of the data, you can use --num-examples to limit the number you evaluate on (as in, the first N). If you want to do something more complex, we would need to talk more about the details.
I meant during training, sorry for being horribly unclear.
To rephrase, I am trying to
(a) train on multiple GPUs
(b) every val step (I know about --validation_every_n_epochs) run validation on the first 500 examples.
Ah, a few tricks to speed up training:
- There is
parlai.scripts.multiprocessing_train. It behaves just like I describedmultiprocessing_evalabove. Simply switch from callingpython -m parlai.scripts.train_modeltopython -m parlai.scripts.multiprocessing_train. We also have distributed train if you happen to be on a SLURM cluster. - There is
--validation-max-exsfor limiting the size of the dataset during validation. I don't usually use it (validation data will be split across the 8 gpus, and I always train with--skip-generation trueso i find it's quite fast, even when validation is 100k examples) - You can also use
--dynamic-batching fullto often get yourself a ~2.5x speedup., to group batches into similarly sized sentences and grow the batch size to maximize memory use. This option doesn't play well with--validation-max-exsthough. - Of course,
--fp16 trueand having apex installed helps.
Oh, and --eval-batchsize is also an option, to pump up the batchsize during validation, since you don't need activations/gradients.
This is good, all of this should be turned into a docs page. Thanks for asking these great questiosn.
Would be super useful.
The reason I couldn't figure this out was that
python examples/train_model.py --help | grep val
python examples/train_model.py --help | grep gpu
don't return the relevant options, because the stuff I was looking for isn't in Training Arguments group I guess.
One of us could also just check in an opinionated single_gpu.opt and multigpu.opt with those defaults and then unit-test that they don't cause warnings or have the word internal. The blenderbot* defaults are OK, but they cause scary warnings and 90M has internal:blended_skill_talk as the first task, which breaks. Also both seem to have duplicated logic across model.opt', model.dict.opt` and the 'overrides' key, which makes it hard to know where to edit stuff.
Ah, the BlenderBot3 and 9 models have --model-parallel true which doesn't mix with multiprocessing_train, sorry. Those are trained as is, because they're so big lol.
(Thinking about your proposal, it's interesting)
This issue has not had activity in 30 days. Please feel free to reopen if you have more issues. You may apply the "never-stale" tag to prevent this from happening.