PaddleCloud
PaddleCloud copied to clipboard
Job is killed when num_passes is larger than 2
When I train my model on local, everything seems to be fine. After I submit my job to paddlecloud, It is killed if num_passes is larger than 2(num_passes is the parameter in trainer.train function)
num_passes is 2: seems ok
num_passes is 3: job is killed after pass 2
num_passes is 4: job is killed after pass 3, job is only killed after the last pass
Besides, the log also shows: Failed trainer count beyond the threadhold: 0, what dose the "trainer count" mean? Do I need to specified this parameter in paddle.init() and how?
Thank you so much!
num_passes is 4: job is killed after pass 3, job is only killed after the last pass
Usually, it's caused by beyond the memory threshold which specified by submitting args -memory
, please try to increase this.
Failed trainer count beyond the threadhold: 0, what dose the "trainer count"
The trainer count
means the number of trainer nodes, this is a system logs, means that the training job will fail when the number of failed trainer node beyond the threshold(here is 0).
And also you don't need to specify any params in paddle.init
, just check the reason for the failed trainer node.