PiPPy icon indicating copy to clipboard operation
PiPPy copied to clipboard

run_glue.py: inconsistent train/eval metrics with and without PiPPy

Open pbelevich opened this issue 2 years ago • 1 comments

the original run_glue.py:

***** train metrics *****
  epoch                    =        3.0
  train_loss               =     0.4244
  train_runtime            = 0:20:44.02
  train_samples            =       3668
  train_samples_per_second =      8.846
  train_steps_per_second   =      0.277
***** eval metrics *****
  epoch                   =        3.0
  eval_accuracy           =     0.8382
  eval_combined_score     =      0.862
  eval_f1                 =     0.8858
  eval_loss               =      0.412
  eval_runtime            = 0:00:15.68
  eval_samples            =        408
  eval_samples_per_second =     26.015
  eval_steps_per_second   =      3.252

run_glue.py no splits 1 stage pipe with backward:

***** train metrics *****
  epoch                    =        3.0
  train_loss               =     1.1115
  train_runtime            = 0:29:12.92
  train_samples            =       3668
  train_samples_per_second =      6.277
  train_steps_per_second   =      0.197
***** eval metrics *****
  epoch                   =        3.0
  eval_accuracy           =     0.3162
  eval_combined_score     =     0.1581
  eval_f1                 =        0.0
  eval_loss               =      1.115
  eval_runtime            = 0:00:14.51
  eval_samples            =        408
  eval_samples_per_second =     28.109
  eval_steps_per_second   =      3.514

no splits 1 stage pipe without backward:

***** train metrics *****
  epoch                    =        3.0
  train_loss               =     1.1115
  train_runtime            = 0:29:58.23
  train_samples            =       3668
  train_samples_per_second =      6.119
  train_steps_per_second   =      0.192
***** eval metrics *****
  epoch                   =        3.0
  eval_accuracy           =     0.3162
  eval_combined_score     =     0.1581
  eval_f1                 =        0.0
  eval_loss               =      1.115
  eval_runtime            = 0:00:15.75
  eval_samples            =        408
  eval_samples_per_second =     25.889
  eval_steps_per_second   =      3.236

8 stages pipe without backward::

***** train metrics *****
  epoch                    =        3.0
  train_loss               =     8.8978
  train_runtime            = 0:20:36.90
  train_samples            =       3668
  train_samples_per_second =      8.896
  train_steps_per_second   =      0.279
***** eval metrics *****
  epoch                   =        3.0
  eval_accuracy           =     0.3162
  eval_combined_score     =     0.1581
  eval_f1                 =        0.0
  eval_loss               =     8.9199
  eval_runtime            = 0:00:16.97
  eval_samples            =        408
  eval_samples_per_second =     24.031
  eval_steps_per_second   =      3.004

pbelevich avatar Aug 16 '22 18:08 pbelevich

It seems that rpc_async in PiPPy goes wrong. I have proposed an issue about this: https://github.com/pytorch/pytorch/issues/83243

In my developing environment, rpc_async().wait() may return zero when model is huge or call is too frequently.

LSTM-Kirigaya avatar Aug 23 '22 08:08 LSTM-Kirigaya