luoyeguigen001

Results 9 comments of luoyeguigen001

多节点读写hdfs是不是会有问题

21/11/24 17:14:47 INFO RunningContext: =====================Server running context start======================= 21/11/24 17:14:47 INFO RunningContext: state = IDLE 21/11/24 17:14:47 INFO RunningContext: totalRunningRPCCounter = 1 21/11/24 17:14:47 INFO RunningContext: infligtingRPCCounter = 0 21/11/24...

没有成功, 我把参数调整了一些 source ./bin/spark-on-angel-env.sh $SPARK_HOME/bin/spark-submit \ --master yarn-cluster\ --conf spark.ps.instances=8 \ --conf spark.ps.cores=4 \ --conf spark.ps.jars=$SONA_ANGEL_JARS \ --conf spark.ps.memory=10g \ --jars $SONA_SPARK_JARS \ --driver-memory 30g \ --num-executors 8 \ --verbose...

刚刚跑了,也有错,但是其他的错误 21/11/25 18:22:54 WARN KafkaClient: callback,kafka send metrics failed, org.apache.kafka.common.errors.TimeoutException: Expiring 1 record(s) for bethune_jobmetrics_test-68:280336 ms has passed since batch creation 21/11/25 18:22:54 ERROR MatrixTransportClient: Request Request{hashCode=7229, header=RequestHeader{clientId=0, token=0, userRequestId=91,...

上面报错的启动脚本是: source ./bin/spark-on-angel-env.sh $SPARK_HOME/bin/spark-submit \ --master yarn-cluster\ --conf spark.ps.instances=8 \ --conf spark.ps.cores=4 \ --conf spark.ps.jars=$SONA_ANGEL_JARS \ --conf spark.ps.memory=10g \ --jars $SONA_SPARK_JARS \ --driver-memory 30g \ --num-executors 8 \ --verbose \...

调小了测试的脚本: source ./bin/spark-on-angel-env.sh $SPARK_HOME/bin/spark-submit \ --master yarn-cluster\ --conf spark.ps.instances=8 \ --conf spark.ps.cores=4 \ --conf spark.ps.jars=$SONA_ANGEL_JARS \ --conf spark.ps.memory=12g \ --jars $SONA_SPARK_JARS \ --driver-memory 40g \ --num-executors 8 \ --verbose \...

是这个么,ERROR MatrixTransportClient: Request Request{hashCode=8803, header=RequestHeader{clientId=0, token=0, userRequestId=111, seqId=8833, methodId=7, matrixId=1, partId=2, handleElemNum=0}, context=com.tencent.angel.ps.server.data.request.RequestContext@61e96c0b} to PS ParameterServer_2 not return result over 120000 milliseconds,我在上面有贴出来

是下面的这个日志么: 21/11/26 09:11:51 INFO TaskSetManager: Finished task 45.0 in stage 0.0 (TID 45) in 17699 ms on (executor 7) (42/56) 21/11/26 09:11:51 INFO TaskSetManager: Finished task 33.0 in stage 0.0...

打了日志,发现还没到训练的地方,直接在model.randomInitialize,初始化embedding的时候就异常了