PaddleFleetX
PaddleFleetX copied to clipboard
pipeline/train_fleet_pipeline.py hang at one epoch finished
https://github.com/PaddlePaddle/FleetX/blob/2817f4f641c5960f7a507a080ae48b642e0d8a45/examples/pipeline/train_fleet_pipeline.py#L99
I run : python -m paddle.distributed.launch --gpus="0,1,2,3" train_fleet_pipeline.py
[Epoch 0, batch 5] loss: 48.88827, acc1: 0.06250, acc5: 0.09375
[Epoch 0, batch 10] loss: 8.77979, acc1: 0.00000, acc5: 0.06250
[Epoch 0, batch 15] loss: 19.01006, acc1: 0.00000, acc5: 0.12500
[Epoch 0, batch 20] loss: 7.24177, acc1: 0.00000, acc5: 0.03125
一个epoch运行完后会hang住. 卡住的位置应该是
if fleet.worker_index() == 3:
loss, acc1, acc5 = exe.run(paddle.static.default_main_program(), fetch_list=[avg_cost, acc_top1, acc_top5])
else:
exe.run(paddle.static.default_main_program())
env:
paddle 2.3.0
cuda 10.2
Ubuntu x86_64 GNU/Linux