PaddleFleetX icon indicating copy to clipboard operation
PaddleFleetX copied to clipboard

pipeline/train_fleet_pipeline.py hang at one epoch finished

Open LukeLIN-web opened this issue 2 years ago • 0 comments

https://github.com/PaddlePaddle/FleetX/blob/2817f4f641c5960f7a507a080ae48b642e0d8a45/examples/pipeline/train_fleet_pipeline.py#L99

I run : python -m paddle.distributed.launch --gpus="0,1,2,3" train_fleet_pipeline.py

[Epoch 0, batch 5] loss: 48.88827, acc1: 0.06250, acc5: 0.09375 [Epoch 0, batch 10] loss: 8.77979, acc1: 0.00000, acc5: 0.06250 [Epoch 0, batch 15] loss: 19.01006, acc1: 0.00000, acc5: 0.12500 [Epoch 0, batch 20] loss: 7.24177, acc1: 0.00000, acc5: 0.03125 一个epoch运行完后会hang住. 卡住的位置应该是 if fleet.worker_index() == 3: loss, acc1, acc5 = exe.run(paddle.static.default_main_program(), fetch_list=[avg_cost, acc_top1, acc_top5])
else: exe.run(paddle.static.default_main_program())
env: paddle 2.3.0 cuda 10.2 Ubuntu x86_64 GNU/Linux

LukeLIN-web avatar Jul 01 '22 14:07 LukeLIN-web