PaddleCloud icon indicating copy to clipboard operation
PaddleCloud copied to clipboard

Job is killed when num_passes is larger than 2

Open christianahui opened this issue 7 years ago • 1 comments

When I train my model on local, everything seems to be fine. After I submit my job to paddlecloud, It is killed if num_passes is larger than 2(num_passes is the parameter in trainer.train function)

num_passes is 2: seems ok d56f4513-8bed-42bb-a7e8-b18ac3f590c4 num_passes is 3: job is killed after pass 2 867acd01b6107a43901ce92c6c7c4b24 num_passes is 4: job is killed after pass 3, job is only killed after the last pass d85e695beda40997ffc5798429aaa1db

Besides, the log also shows: Failed trainer count beyond the threadhold: 0, what dose the "trainer count" mean? Do I need to specified this parameter in paddle.init() and how? 9c9c90dcaa4dcc06b0fbec79acfd2c97 Thank you so much!

christianahui avatar Feb 07 '18 03:02 christianahui

num_passes is 4: job is killed after pass 3, job is only killed after the last pass

Usually, it's caused by beyond the memory threshold which specified by submitting args -memory, please try to increase this.

Failed trainer count beyond the threadhold: 0, what dose the "trainer count"

The trainer count means the number of trainer nodes, this is a system logs, means that the training job will fail when the number of failed trainer node beyond the threshold(here is 0).

And also you don't need to specify any params in paddle.init, just check the reason for the failed trainer node.

Yancey1989 avatar Feb 07 '18 06:02 Yancey1989