angel icon indicating copy to clipboard operation
angel copied to clipboard

wait for master location timeout

Open fanfanl opened this issue 3 years ago • 1 comments

21/11/02 15:31:29 INFO yarn.AngelYarnClient: ApplicationSubmissionContext Queuename : default 21/11/02 15:31:30 INFO impl.YarnClientImpl: Submitted application application_1634199305695_4691672 21/11/02 15:41:32 ERROR yarn.AngelYarnClient: submit application to yarn failed. java.io.IOException: wait for master location timeout at com.tencent.angel.client.yarn.AngelYarnClient.updateMaster(AngelYarnClient.java:555) at com.tencent.angel.client.yarn.AngelYarnClient.startPSServer(AngelYarnClient.java:168) at com.tencent.angel.ml.MLRunner$class.train(MLRunner.scala:47) at com.tencent.angel.ml.factorizationmachines.FMRunner.train(FMRunner.scala:26) at com.tencent.angel.ml.factorizationmachines.FMRunner.train(FMRunner.scala:32) at com.tencent.angel.ml.MLRunner$class.submit(MLRunner.scala:95) at com.tencent.angel.ml.factorizationmachines.FMRunner.submit(FMRunner.scala:26) at com.tencent.angel.utils.AngelRunJar$1.run(AngelRunJar.java:67) at com.tencent.angel.utils.AngelRunJar$1.run(AngelRunJar.java:54) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1726) at com.tencent.angel.utils.AngelRunJar.submit(AngelRunJar.java:54) at com.tencent.angel.utils.AngelRunJar.main(AngelRunJar.java:38) 21/11/02 15:41:32 INFO impl.YarnClientImpl: Killed application application_1634199305695_4691672 21/11/02 15:41:32 FATAL utils.AngelRunJar: submit job failed com.tencent.angel.exception.AngelException: java.io.IOException: wait for master location timeout at com.tencent.angel.client.yarn.AngelYarnClient.startPSServer(AngelYarnClient.java:174) at com.tencent.angel.ml.MLRunner$class.train(MLRunner.scala:47) at com.tencent.angel.ml.factorizationmachines.FMRunner.train(FMRunner.scala:26) at com.tencent.angel.ml.factorizationmachines.FMRunner.train(FMRunner.scala:32) at com.tencent.angel.ml.MLRunner$class.submit(MLRunner.scala:95) at com.tencent.angel.ml.factorizationmachines.FMRunner.submit(FMRunner.scala:26) at com.tencent.angel.utils.AngelRunJar$1.run(AngelRunJar.java:67) at com.tencent.angel.utils.AngelRunJar$1.run(AngelRunJar.java:54) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1726) at com.tencent.angel.utils.AngelRunJar.submit(AngelRunJar.java:54) at com.tencent.angel.utils.AngelRunJar.main(AngelRunJar.java:38) Caused by: java.io.IOException: wait for master location timeout at com.tencent.angel.client.yarn.AngelYarnClient.updateMaster(AngelYarnClient.java:555) at com.tencent.angel.client.yarn.AngelYarnClient.startPSServer(AngelYarnClient.java:168) ... 12 more 更改angel.am.appstate.timeout.ms依旧是等待10分钟后自动killed,angel版本为1.4.0

fanfanl avatar Nov 02 '21 09:11 fanfanl

21/11/02 15:31:29 INFO yarn.AngelYarnClient: ApplicationSubmissionContext Queuename : default 21/11/02 15:31:30 INFO impl.YarnClientImpl: Submitted application application_1634199305695_4691672 21/11/02 15:41:32 ERROR yarn.AngelYarnClient: submit application to yarn failed. java.io.IOException: wait for master location timeout at com.tencent.angel.client.yarn.AngelYarnClient.updateMaster(AngelYarnClient.java:555) at com.tencent.angel.client.yarn.AngelYarnClient.startPSServer(AngelYarnClient.java:168) at com.tencent.angel.ml.MLRunner$class.train(MLRunner.scala:47) at com.tencent.angel.ml.factorizationmachines.FMRunner.train(FMRunner.scala:26) at com.tencent.angel.ml.factorizationmachines.FMRunner.train(FMRunner.scala:32) at com.tencent.angel.ml.MLRunner$class.submit(MLRunner.scala:95) at com.tencent.angel.ml.factorizationmachines.FMRunner.submit(FMRunner.scala:26) at com.tencent.angel.utils.AngelRunJar$1.run(AngelRunJar.java:67) at com.tencent.angel.utils.AngelRunJar$1.run(AngelRunJar.java:54) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1726) at com.tencent.angel.utils.AngelRunJar.submit(AngelRunJar.java:54) at com.tencent.angel.utils.AngelRunJar.main(AngelRunJar.java:38) 21/11/02 15:41:32 INFO impl.YarnClientImpl: Killed application application_1634199305695_4691672 21/11/02 15:41:32 FATAL utils.AngelRunJar: submit job failed com.tencent.angel.exception.AngelException: java.io.IOException: wait for master location timeout at com.tencent.angel.client.yarn.AngelYarnClient.startPSServer(AngelYarnClient.java:174) at com.tencent.angel.ml.MLRunner$class.train(MLRunner.scala:47) at com.tencent.angel.ml.factorizationmachines.FMRunner.train(FMRunner.scala:26) at com.tencent.angel.ml.factorizationmachines.FMRunner.train(FMRunner.scala:32) at com.tencent.angel.ml.MLRunner$class.submit(MLRunner.scala:95) at com.tencent.angel.ml.factorizationmachines.FMRunner.submit(FMRunner.scala:26) at com.tencent.angel.utils.AngelRunJar$1.run(AngelRunJar.java:67) at com.tencent.angel.utils.AngelRunJar$1.run(AngelRunJar.java:54) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1726) at com.tencent.angel.utils.AngelRunJar.submit(AngelRunJar.java:54) at com.tencent.angel.utils.AngelRunJar.main(AngelRunJar.java:38) Caused by: java.io.IOException: wait for master location timeout at com.tencent.angel.client.yarn.AngelYarnClient.updateMaster(AngelYarnClient.java:555) at com.tencent.angel.client.yarn.AngelYarnClient.startPSServer(AngelYarnClient.java:168) ... 12 more 更改angel.am.appstate.timeout.ms依旧是等待10分钟后自动killed,angel版本为1.4.0

那就是你队列本身资源就不够了,你看下你的队列有没有资源。

ouyangwen-it avatar Nov 24 '21 10:11 ouyangwen-it