Ouyang Wen

Results 26 comments of Ouyang Wen

看一下ps的gc日志页面,有频繁的full gc日志吗

> 21/11/24 17:14:47 INFO RunningContext: =====================Server running context start======================= 21/11/24 17:14:47 INFO RunningContext: state = IDLE 21/11/24 17:14:47 INFO RunningContext: totalRunningRPCCounter = 1 21/11/24 17:14:47 INFO RunningContext: infligtingRPCCounter = 0...

> angel.netty.matrixtransfer.max.message.size angel.netty.matrixtransfer.max.message.size,这个参数这么用--conf spark.hadoop.angel.netty.matrixtransfer.max.message.size=

![image](https://user-images.githubusercontent.com/13943417/143410841-4d8db633-9524-422b-ba8c-6f6516504ec0.png) 另外,ps分区数你调小到80,batchsize调小到50 重新提交下

> 上面报错的启动脚本是: source ./bin/spark-on-angel-env.sh $SPARK_HOME/bin/spark-submit --master yarn-cluster --conf spark.ps.instances=8 --conf spark.ps.cores=4 --conf spark.ps.jars=$SONA_ANGEL_JARS --conf spark.ps.memory=10g --jars $SONA_SPARK_JARS --driver-memory 30g --num-executors 8 --verbose --executor-cores 4 --executor-memory 15g --conf spark.default.parallelism=5000 --conf spark.hadoop.angel.netty.matrixtransfer.max.message.size=1073741824...

![image](https://user-images.githubusercontent.com/13943417/143520011-31363afd-d809-4d59-856d-ff9af7d7d4bc.png) 看一下angel master日志有没有错误信息,看着超时挂了

再看一下spark executor日志,看看一个batch拉取的耗时多少

> 是下面的这个日志么: 21/11/26 09:11:51 INFO TaskSetManager: Finished task 45.0 in stage 0.0 (TID 45) in 17699 ms on (executor 7) (42/56) 21/11/26 09:11:51 INFO TaskSetManager: Finished task 33.0 in stage...

you should use spark-submit option:--principal, --keytab while use SONA.