PiPPy
PiPPy copied to clipboard
run_glue.py: inconsistent train/eval metrics with and without PiPPy
the original run_glue.py:
***** train metrics *****
epoch = 3.0
train_loss = 0.4244
train_runtime = 0:20:44.02
train_samples = 3668
train_samples_per_second = 8.846
train_steps_per_second = 0.277
***** eval metrics *****
epoch = 3.0
eval_accuracy = 0.8382
eval_combined_score = 0.862
eval_f1 = 0.8858
eval_loss = 0.412
eval_runtime = 0:00:15.68
eval_samples = 408
eval_samples_per_second = 26.015
eval_steps_per_second = 3.252
run_glue.py no splits 1 stage pipe with backward:
***** train metrics *****
epoch = 3.0
train_loss = 1.1115
train_runtime = 0:29:12.92
train_samples = 3668
train_samples_per_second = 6.277
train_steps_per_second = 0.197
***** eval metrics *****
epoch = 3.0
eval_accuracy = 0.3162
eval_combined_score = 0.1581
eval_f1 = 0.0
eval_loss = 1.115
eval_runtime = 0:00:14.51
eval_samples = 408
eval_samples_per_second = 28.109
eval_steps_per_second = 3.514
no splits 1 stage pipe without backward:
***** train metrics *****
epoch = 3.0
train_loss = 1.1115
train_runtime = 0:29:58.23
train_samples = 3668
train_samples_per_second = 6.119
train_steps_per_second = 0.192
***** eval metrics *****
epoch = 3.0
eval_accuracy = 0.3162
eval_combined_score = 0.1581
eval_f1 = 0.0
eval_loss = 1.115
eval_runtime = 0:00:15.75
eval_samples = 408
eval_samples_per_second = 25.889
eval_steps_per_second = 3.236
8 stages pipe without backward::
***** train metrics *****
epoch = 3.0
train_loss = 8.8978
train_runtime = 0:20:36.90
train_samples = 3668
train_samples_per_second = 8.896
train_steps_per_second = 0.279
***** eval metrics *****
epoch = 3.0
eval_accuracy = 0.3162
eval_combined_score = 0.1581
eval_f1 = 0.0
eval_loss = 8.9199
eval_runtime = 0:00:16.97
eval_samples = 408
eval_samples_per_second = 24.031
eval_steps_per_second = 3.004
It seems that rpc_async in PiPPy goes wrong. I have proposed an issue about this: https://github.com/pytorch/pytorch/issues/83243
In my developing environment, rpc_async().wait() may return zero when model is huge or call is too frequently.