PyTorch-On-Angel icon indicating copy to clipboard operation
PyTorch-On-Angel copied to clipboard

报错:NoClassDefFoundError: com/tencent/angel/spark/ml/graph/params/HasBatchSize

Open fy88fy opened this issue 4 years ago • 17 comments

你好,问一下? 我在测试使用PyTorch-On-Angel ,提交报错了,在代码里找到的是com.tencent.angel.graph.utils.params.HasBatchSize,会是什么原因导致的呢?

param mode = yarn-client Exception in thread "main" java.lang.NoClassDefFoundError: com/tencent/angel/spark/ml/graph/params/HasBatchSize at java.lang.ClassLoader.defineClass1(Native Method) at java.lang.ClassLoader.defineClass(ClassLoader.java:763) at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142) at java.net.URLClassLoader.defineClass(URLClassLoader.java:467) at java.net.URLClassLoader.access$100(URLClassLoader.java:73) at java.net.URLClassLoader$1.run(URLClassLoader.java:368) at java.net.URLClassLoader$1.run(URLClassLoader.java:362) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:361) at java.lang.ClassLoader.loadClass(ClassLoader.java:424) at java.lang.ClassLoader.loadClass(ClassLoader.java:357) at com.tencent.angel.pytorch.examples.supervised.RecommendationExample$.main(RecommendationExample.scala:53) at com.tencent.angel.pytorch.examples.supervised.RecommendationExample.main(RecommendationExample.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52) at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:852) at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:161) at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:184) at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:86) at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:927) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:936) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) Caused by: java.lang.ClassNotFoundException: com.tencent.angel.spark.ml.graph.params.HasBatchSize at java.net.URLClassLoader.findClass(URLClassLoader.java:381) at java.lang.ClassLoader.loadClass(ClassLoader.java:424) at java.lang.ClassLoader.loadClass(ClassLoader.java:357) ... 25 more

fy88fy avatar Dec 23 '20 10:12 fy88fy

你使用的是哪个分支,spark是哪个版本,用最新的0.2.1分支试下

ouyangwen-it avatar Dec 30 '20 09:12 ouyangwen-it

pytorch-on-angel用的是master的,spark是2.4.5U2,我们用的是纯cpu环境

dongxuej avatar Dec 30 '20 09:12 dongxuej

pytorch-on-angel用的是master的,spark是2.4.5U2,我们用的是纯cpu环境

用0.2.1分支,angel环境用3.1.0的

ouyangwen-it avatar Dec 30 '20 09:12 ouyangwen-it

angel现在用的也是master的,那个就是3.1.0的吧??还是非得下载branch3.1.0

dongxuej avatar Dec 30 '20 09:12 dongxuej

angel现在用的也是master的,那个就是3.1.0的吧??还是非得下载branch3.1.0 master就可以了

ouyangwen-it avatar Dec 30 '20 10:12 ouyangwen-it

这个classnotfound的问题已经解决了~我现在遇到了新的问题~

Exception in thread "main" java.lang.UnsatisfiedLinkError: no torch_angel in java.library.path

我给的参数如下: input=hdfs://jinrong-hadoop3-1v/home/hdp-jinrong-stargraph/fanqizha/subgraph/input/20191231/ output=hdfs://jinrong-hadoop3-1v/home/hdp-jinrong-stargraph/jiadongxue/angel/model/20191231_deepfm/ source ./spark-on-angel-env.sh echo "------------------" #JAVA_LIBRARY_PATH=/home/work/software/java/lib JAVA_LIBRARY_PATH=/home/work/software/angel/lib:/home/work/software/java/lib echo $JAVA_LIBRARY_PATH $SPARK_HOME/bin/spark-submit
--conf spark.ps.instances=5
--conf spark.ps.cores=1
--conf spark.ps.jars=$SONA_ANGEL_JARS
--conf spark.ps.memory=5g
--conf spark.ps.log.level=INFO
--conf spark.driver.extraJavaOptions=-Djava.library.path=$JAVA_LIBRARY_PATH:.:./torch/lib
--conf spark.executor.extraJavaOptions=-Djava.library.path=$JAVA_LIBRARY_PATH:.:./torch/lib
--conf spark.executor.extraLibraryPath=./torch/lib
--conf spark.driver.extraLibraryPath=./torch/lib
--conf spark.executorEnv.OMP_NUM_THREADS=2
--conf spark.executorEnv.MKL_NUM_THREADS=2
--conf spark.executorEnv.JAVA_HOME=/home/work/software/java
--conf spark.yarn.appMasterEnv.JAVA_HOME=/home/work/software/java
--jars $SONA_SPARK_JARS
--name "deepfm for torch on angel"
--archives /home/work/software/angel/bin/torchlib.zip#torch
--files /home/work/software/angel/bin/deepfm.pt
--driver-memory 5g
--num-executors 5
--executor-cores 1
--executor-memory 5g
--class com.tencent.angel.pytorch.examples.supervised.RecommendationExample
./pytorch-on-angel-0.2.1.jar
trainInput:$input batchSize:128 torchModelPath:deepfm.pt
stepSize:0.001 numEpoch:10 testRatio:0.1
angelModelOutputPath:$output mode:yarn-client \

报错如下: image

dongxuej avatar Jan 04 '21 08:01 dongxuej

这个classnotfound的问题已经解决了~我现在遇到了新的问题~

Exception in thread "main" java.lang.UnsatisfiedLinkError: no torch_angel in java.library.path

我给的参数如下: input=hdfs://jinrong-hadoop3-1v/home/hdp-jinrong-stargraph/fanqizha/subgraph/input/20191231/ output=hdfs://jinrong-hadoop3-1v/home/hdp-jinrong-stargraph/jiadongxue/angel/model/20191231_deepfm/ source ./spark-on-angel-env.sh echo "------------------" #JAVA_LIBRARY_PATH=/home/work/software/java/lib JAVA_LIBRARY_PATH=/home/work/software/angel/lib:/home/work/software/java/lib echo $JAVA_LIBRARY_PATH $SPARK_HOME/bin/spark-submit --conf spark.ps.instances=5 --conf spark.ps.cores=1 --conf spark.ps.jars=$SONA_ANGEL_JARS --conf spark.ps.memory=5g --conf spark.ps.log.level=INFO --conf spark.driver.extraJavaOptions=-Djava.library.path=$JAVA_LIBRARY_PATH:.:./torch/lib --conf spark.executor.extraJavaOptions=-Djava.library.path=$JAVA_LIBRARY_PATH:.:./torch/lib --conf spark.executor.extraLibraryPath=./torch/lib --conf spark.driver.extraLibraryPath=./torch/lib --conf spark.executorEnv.OMP_NUM_THREADS=2 --conf spark.executorEnv.MKL_NUM_THREADS=2 --conf spark.executorEnv.JAVA_HOME=/home/work/software/java --conf spark.yarn.appMasterEnv.JAVA_HOME=/home/work/software/java --jars $SONA_SPARK_JARS --name "deepfm for torch on angel" --archives /home/work/software/angel/bin/torchlib.zip#torch --files /home/work/software/angel/bin/deepfm.pt --driver-memory 5g --num-executors 5 --executor-cores 1 --executor-memory 5g --class com.tencent.angel.pytorch.examples.supervised.RecommendationExample ./pytorch-on-angel-0.2.1.jar trainInput:$input batchSize:128 torchModelPath:deepfm.pt stepSize:0.001 numEpoch:10 testRatio:0.1 angelModelOutputPath:$output mode:yarn-client \

报错如下: image

image 应该是你的依赖包解压后的目录和你设置的不匹配,是这样的,spark的--archives参数会把你的hdfs上的压缩包解压到executor执行目录下,目录名是井号后面那个别名,目录应该是./torch/(你压缩包解压后的目录结构) --archives跟的是hdfs路径

ouyangwen-it avatar Jan 04 '21 08:01 ouyangwen-it

你好,我现在遇到个问题,是提交不到yarn上。找不到hdfs上的deepfm.pt文件,麻烦帮忙看一下。 脚本配置如下:

#!/bin/bash
input=hdfs://xxxx-1v/home/yarn/pytorch-on-angel/census_148d_train.libsvm.tmp
output=hdfs://xxxx-1v/home/hdp/jia/angel/model/20191231_louvain/
source ./spark-on-angel-env.sh
echo "------------------"
#JAVA_LIBRARY_PATH=/home/work/software/java/lib
JAVA_LIBRARY_PATH=/home/work/software/angel/lib:/home/work/software/java/lib
echo $JAVA_LIBRARY_PATH
$SPARK_HOME/bin/spark-submit \
       --conf spark.ps.instances=5 \
       --conf spark.ps.cores=1 \
       --conf spark.ps.jars=$SONA_ANGEL_JARS \
       --conf spark.ps.memory=5g \
       --conf spark.ps.log.level=INFO \
       --archives hdfs://XXXX-hadoop3-1v/home/yarn/pytorch-on-angel/torchlib.zip#torch \
       --conf spark.driver.extraJavaOptions=-Djava.library.path=$JAVA_LIBRARY_PATH:.:./torch/lib \
       --conf spark.executor.extraJavaOptions=-Djava.library.path=$JAVA_LIBRARY_PATH:.:./torch/lib \
       --conf spark.executor.extraLibraryPath=./torch/lib \
       --conf spark.driver.extraLibraryPath=./torch/lib \
       --conf spark.executorEnv.OMP_NUM_THREADS=2 \
       --conf spark.executorEnv.MKL_NUM_THREADS=2 \
       --conf spark.executorEnv.JAVA_HOME=/home/work/software/java \
       --conf spark.yarn.appMasterEnv.JAVA_HOME=/home/work/software/java \
       --conf spark.hadoop.fs.defaultFS=hdfs://xxxx-hadoop3-1v/ \
       --jars $SONA_SPARK_JARS  \
       --name "deepfm for torch on angel" \
       --files deepfm.pt \
       --driver-memory 5g \
       --num-executors 5 \
       --executor-cores 1 \
       --executor-memory 5g \
       --class com.tencent.angel.pytorch.examples.supervised.RecommendationExample \
       ./pytorch-on-angel-0.2.1.jar\
       trainInput:$input batchSize:128 torchModelPath:deepfm.pt \
       stepSize:0.001 numEpoch:10 testRatio:0.1 \
       angelModelOutputPath:$output mode:yarn-client \

image

fy88fy avatar Jan 04 '21 10:01 fy88fy

你spark用yarn-cluster模式提交试试呢

ouyangwen-it avatar Jan 04 '21 12:01 ouyangwen-it

改为yarn-cluster报如下错误了: image

image

fy88fy avatar Jan 05 '21 07:01 fy88fy

你这个torchlib.zip压缩包解压的目录结构是什么样的

ouyangwen-it avatar Jan 05 '21 09:01 ouyangwen-it

torchlib.zip解压开是lib目录,lib下是很多.a文件 image

fy88fy avatar Jan 05 '21 09:01 fy88fy

torchlib.zip解压开是lib目录,lib下是很多.a文件 image

你可以在RecommendationExample里面把当前目录打印下看看吗,看有没有torch/lib

ouyangwen-it avatar Jan 06 '21 02:01 ouyangwen-it

你好,我最后把集群所有节点环境都配置了一下。yarn-client模式就可以用了。但是偶尔会报这个错,是什么原因导致的呢。

21/01/13 13:07:37 INFO Client: Application report for application_1609301285435_0588 (state: ACCEPTED)
21/01/13 13:07:38 INFO Client: Application report for application_1609301285435_0588 (state: ACCEPTED)
21/01/13 13:07:39 INFO Client: Application report for application_1609301285435_0588 (state: ACCEPTED)
21/01/13 13:07:40 INFO Client: Application report for application_1609301285435_0588 (state: ACCEPTED)
21/01/13 13:07:41 INFO Client: Application report for application_1609301285435_0588 (state: ACCEPTED)
21/01/13 13:07:42 INFO Client: Application report for application_1609301285435_0588 (state: ACCEPTED)
21/01/13 13:07:43 INFO Client: Application report for application_1609301285435_0588 (state: ACCEPTED)
21/01/13 13:07:44 INFO Client: Application report for application_1609301285435_0588 (state: ACCEPTED)
21/01/13 13:07:45 INFO Client: Application report for application_1609301285435_0588 (state: RUNNING)
21/01/13 13:07:45 INFO Client: 
	 client token: N/A
	 diagnostics: N/A
	 ApplicationMaster host: 10.x.121.219
	 ApplicationMaster RPC port: -1
	 queue: root.default
	 start time: 1610514454704
	 final status: UNDEFINED
	 tracking URL: http://xxx.net:8888/proxy/application_1609301285435_0588/
	 user: user-2001
numDataPartitions=7500
numDataPartitions=7500
type: BIAS_WEIGHT_EMBEDDING_MATS name:DeepFM mats_dims: 130,10,1,1,10,5,1,1,5,1,1,1
optimizer: AsyncAdam eta=0.001 decay=0.001
from driver start Angel PS! 
AppMaster capability = <memory:2048, vCores:1, gCores:0>
validate_auc=0.8820555586167144 time=12161ms                                     
train_auc=0.8922190567461854 validate_auc=0.8954355298794986 time=3696ms        
train_auc=0.9007353906278361 validate_auc=0.9038945708702969 time=2840ms        
train_auc=0.9039252388499588 validate_auc=0.9064111964938902 time=2716ms        
train_auc=0.9068299665301356 validate_auc=0.9090327256697933 time=2602ms        
train_auc=0.9075874962681636 validate_auc=0.909377106018285 time=2575ms         
train_auc=0.9093007058840281 validate_auc=0.9113024570743141 time=2535ms        
train_auc=0.9108590019247338 validate_auc=0.9131101889959353 time=2603ms        
train_auc=0.9103151686457647 validate_auc=0.9114020624674163 time=2616ms        
train_auc=0.9114118253731684 validate_auc=0.9126884555230131 time=2597ms        
Exception in thread "main" com.tencent.angel.exception.AngelException: Model save falied Detail failed log:
ParameterServer_0:save task com.tencent.angel.model.PSMatricesSaveContext@6fc796e8 failed:java.io.IOException: java.io.IOException: SAVE matrix failed:java.io.IOException: SAVE model partitions failed.null
	at com.tencent.angel.model.output.format.MatrixFormatImpl.save(MatrixFormatImpl.java:135)
	at com.tencent.angel.ps.io.PSModelIOExecutor.saveMatrix(PSModelIOExecutor.java:237)
	at com.tencent.angel.ps.io.PSModelIOExecutor.process(PSModelIOExecutor.java:216)
	at com.tencent.angel.ps.io.PSModelIOExecutor.access$000(PSModelIOExecutor.java:39)
	at com.tencent.angel.ps.io.PSModelIOExecutor$MatrixDiskIOOp.compute(PSModelIOExecutor.java:195)
	at java.util.concurrent.RecursiveAction.exec(RecursiveAction.java:189)
	at java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:289)
	at java.util.concurrent.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1056)
	at java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1692)
	at java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:157)

	at com.tencent.angel.ps.io.PSModelIOExecutor.save(PSModelIOExecutor.java:154)
	at com.tencent.angel.ps.io.PSModelIOExecutor.save(PSModelIOExecutor.java:121)
	at com.tencent.angel.ps.io.save.PSModelSaver.lambda$save$9(PSModelSaver.java:184)
	at java.lang.Thread.run(Thread.java:748)
Caused by: java.io.IOException: SAVE matrix failed:java.io.IOException: SAVE model partitions failed.null
	at com.tencent.angel.model.output.format.MatrixFormatImpl.save(MatrixFormatImpl.java:135)
	at com.tencent.angel.ps.io.PSModelIOExecutor.saveMatrix(PSModelIOExecutor.java:237)
	at com.tencent.angel.ps.io.PSModelIOExecutor.process(PSModelIOExecutor.java:216)
	at com.tencent.angel.ps.io.PSModelIOExecutor.access$000(PSModelIOExecutor.java:39)
	at com.tencent.angel.ps.io.PSModelIOExecutor$MatrixDiskIOOp.compute(PSModelIOExecutor.java:195)
	at java.util.concurrent.RecursiveAction.exec(RecursiveAction.java:189)
	at java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:289)
	at java.util.concurrent.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1056)
	at java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1692)
	at java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:157)

	at com.tencent.angel.ps.io.PSModelIOExecutor.save(PSModelIOExecutor.java:151)
	... 3 more


	at com.tencent.angel.client.AngelClient.isSaveCompleted(AngelClient.java:670)
	at com.tencent.angel.client.AngelClient.save(AngelClient.java:381)
	at com.tencent.angel.client.AngelPSClient.save(AngelPSClient.java:146)
	at com.tencent.angel.spark.context.AngelPSContext$.save(AngelPSContext.scala:300)
	at com.tencent.angel.spark.context.AngelPSContext.save(AngelPSContext.scala:258)
	at com.tencent.angel.pytorch.recommendation.RecommendPSModel.savePSModel(RecommendPSModel.scala:234)
	at com.tencent.angel.pytorch.examples.supervised.RecommendationExample$.main(RecommendationExample.scala:88)
	at com.tencent.angel.pytorch.examples.supervised.RecommendationExample.main(RecommendationExample.scala)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
	at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:852)
	at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:161)
	at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:184)
	at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:86)
	at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:927)
	at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:936)
	at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

`input=hdfs://XXX-1v/home/yarn/pytorch-on-angel/census_148d_train.libsvm.tmp output=hdfs://XXX-1v/home/user/pytorch-on-angel/${DT} source /home/work/software/angel/bin/spark-on-angel-env.sh echo "------------------" JAVA_LIBRARY_PATH=/home/work/software/angel/lib:/home/work/software/java/lib

echo $JAVA_LIBRARY_PATH $SPARK_HOME/bin/spark-submit
--conf spark.ps.instances=15
--conf spark.ps.cores=1
--conf spark.ps.jars=$SONA_ANGEL_JARS
--conf spark.ps.memory=5g
--conf spark.ps.log.level=INFO
--archives hdfs://XXX-1v/home/yarn/pytorch-on-angel/torchlib.zip#torch
--conf spark.driver.extraJavaOptions=-Djava.library.path=$JAVA_LIBRARY_PATH:.:/home/work/software/angel/bin/torch/lib
--conf spark.executor.extraJavaOptions=-Djava.library.path=$JAVA_LIBRARY_PATH:.:/home/work/software/angel/bin/torch/lib
--conf spark.executor.extraLibraryPath=/home/work/software/angel/bin/torch/lib
--conf spark.driver.extraLibraryPath=/home/work/software/angel/bin/torch/lib
--conf spark.executorEnv.OMP_NUM_THREADS=2
--conf spark.executorEnv.MKL_NUM_THREADS=2
--conf spark.executorEnv.JAVA_HOME=/home/work/software/java
--conf spark.yarn.appMasterEnv.JAVA_HOME=/home/work/software/java
--conf spark.hadoop.fs.defaultFS=hdfs://XXX-1v/
--jars $SONA_SPARK_JARS
--name "deepfm for torch"
--driver-memory 5g
--num-executors 15
--executor-cores 5
--executor-memory 8g
--class com.tencent.angel.pytorch.examples.supervised.RecommendationExample
/home/work/software/angel/bin/pytorch-on-angel-0.2.1.jar.old.jar
trainInput:$input batchSize:128 torchModelPath:/home/work/software/angel/bin/deepfm.pt
stepSize:0.001 numEpoch:10 testRatio:0.1
angelModelOutputPath:$output mode:yarn-client
`

fy88fy avatar Jan 13 '21 05:01 fy88fy

你好,我最后把集群所有节点环境都配置了一下。yarn-client模式就可以用了。但是偶尔会报这个错,是什么原因导致的呢。

21/01/13 13:07:37 INFO Client: Application report for application_1609301285435_0588 (state: ACCEPTED)
21/01/13 13:07:38 INFO Client: Application report for application_1609301285435_0588 (state: ACCEPTED)
21/01/13 13:07:39 INFO Client: Application report for application_1609301285435_0588 (state: ACCEPTED)
21/01/13 13:07:40 INFO Client: Application report for application_1609301285435_0588 (state: ACCEPTED)
21/01/13 13:07:41 INFO Client: Application report for application_1609301285435_0588 (state: ACCEPTED)
21/01/13 13:07:42 INFO Client: Application report for application_1609301285435_0588 (state: ACCEPTED)
21/01/13 13:07:43 INFO Client: Application report for application_1609301285435_0588 (state: ACCEPTED)
21/01/13 13:07:44 INFO Client: Application report for application_1609301285435_0588 (state: ACCEPTED)
21/01/13 13:07:45 INFO Client: Application report for application_1609301285435_0588 (state: RUNNING)
21/01/13 13:07:45 INFO Client: 
	 client token: N/A
	 diagnostics: N/A
	 ApplicationMaster host: 10.x.121.219
	 ApplicationMaster RPC port: -1
	 queue: root.default
	 start time: 1610514454704
	 final status: UNDEFINED
	 tracking URL: http://xxx.net:8888/proxy/application_1609301285435_0588/
	 user: user-2001
numDataPartitions=7500
numDataPartitions=7500
type: BIAS_WEIGHT_EMBEDDING_MATS name:DeepFM mats_dims: 130,10,1,1,10,5,1,1,5,1,1,1
optimizer: AsyncAdam eta=0.001 decay=0.001
from driver start Angel PS! 
AppMaster capability = <memory:2048, vCores:1, gCores:0>
validate_auc=0.8820555586167144 time=12161ms                                     
train_auc=0.8922190567461854 validate_auc=0.8954355298794986 time=3696ms        
train_auc=0.9007353906278361 validate_auc=0.9038945708702969 time=2840ms        
train_auc=0.9039252388499588 validate_auc=0.9064111964938902 time=2716ms        
train_auc=0.9068299665301356 validate_auc=0.9090327256697933 time=2602ms        
train_auc=0.9075874962681636 validate_auc=0.909377106018285 time=2575ms         
train_auc=0.9093007058840281 validate_auc=0.9113024570743141 time=2535ms        
train_auc=0.9108590019247338 validate_auc=0.9131101889959353 time=2603ms        
train_auc=0.9103151686457647 validate_auc=0.9114020624674163 time=2616ms        
train_auc=0.9114118253731684 validate_auc=0.9126884555230131 time=2597ms        
Exception in thread "main" com.tencent.angel.exception.AngelException: Model save falied Detail failed log:
ParameterServer_0:save task com.tencent.angel.model.PSMatricesSaveContext@6fc796e8 failed:java.io.IOException: java.io.IOException: SAVE matrix failed:java.io.IOException: SAVE model partitions failed.null
	at com.tencent.angel.model.output.format.MatrixFormatImpl.save(MatrixFormatImpl.java:135)
	at com.tencent.angel.ps.io.PSModelIOExecutor.saveMatrix(PSModelIOExecutor.java:237)
	at com.tencent.angel.ps.io.PSModelIOExecutor.process(PSModelIOExecutor.java:216)
	at com.tencent.angel.ps.io.PSModelIOExecutor.access$000(PSModelIOExecutor.java:39)
	at com.tencent.angel.ps.io.PSModelIOExecutor$MatrixDiskIOOp.compute(PSModelIOExecutor.java:195)
	at java.util.concurrent.RecursiveAction.exec(RecursiveAction.java:189)
	at java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:289)
	at java.util.concurrent.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1056)
	at java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1692)
	at java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:157)

	at com.tencent.angel.ps.io.PSModelIOExecutor.save(PSModelIOExecutor.java:154)
	at com.tencent.angel.ps.io.PSModelIOExecutor.save(PSModelIOExecutor.java:121)
	at com.tencent.angel.ps.io.save.PSModelSaver.lambda$save$9(PSModelSaver.java:184)
	at java.lang.Thread.run(Thread.java:748)
Caused by: java.io.IOException: SAVE matrix failed:java.io.IOException: SAVE model partitions failed.null
	at com.tencent.angel.model.output.format.MatrixFormatImpl.save(MatrixFormatImpl.java:135)
	at com.tencent.angel.ps.io.PSModelIOExecutor.saveMatrix(PSModelIOExecutor.java:237)
	at com.tencent.angel.ps.io.PSModelIOExecutor.process(PSModelIOExecutor.java:216)
	at com.tencent.angel.ps.io.PSModelIOExecutor.access$000(PSModelIOExecutor.java:39)
	at com.tencent.angel.ps.io.PSModelIOExecutor$MatrixDiskIOOp.compute(PSModelIOExecutor.java:195)
	at java.util.concurrent.RecursiveAction.exec(RecursiveAction.java:189)
	at java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:289)
	at java.util.concurrent.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1056)
	at java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1692)
	at java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:157)

	at com.tencent.angel.ps.io.PSModelIOExecutor.save(PSModelIOExecutor.java:151)
	... 3 more


	at com.tencent.angel.client.AngelClient.isSaveCompleted(AngelClient.java:670)
	at com.tencent.angel.client.AngelClient.save(AngelClient.java:381)
	at com.tencent.angel.client.AngelPSClient.save(AngelPSClient.java:146)
	at com.tencent.angel.spark.context.AngelPSContext$.save(AngelPSContext.scala:300)
	at com.tencent.angel.spark.context.AngelPSContext.save(AngelPSContext.scala:258)
	at com.tencent.angel.pytorch.recommendation.RecommendPSModel.savePSModel(RecommendPSModel.scala:234)
	at com.tencent.angel.pytorch.examples.supervised.RecommendationExample$.main(RecommendationExample.scala:88)
	at com.tencent.angel.pytorch.examples.supervised.RecommendationExample.main(RecommendationExample.scala)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
	at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:852)
	at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:161)
	at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:184)
	at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:86)
	at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:927)
	at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:936)
	at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

`input=hdfs://XXX-1v/home/yarn/pytorch-on-angel/census_148d_train.libsvm.tmp output=hdfs://XXX-1v/home/user/pytorch-on-angel/${DT} source /home/work/software/angel/bin/spark-on-angel-env.sh echo "------------------" JAVA_LIBRARY_PATH=/home/work/software/angel/lib:/home/work/software/java/lib

echo $JAVA_LIBRARY_PATH $SPARK_HOME/bin/spark-submit --conf spark.ps.instances=15 --conf spark.ps.cores=1 --conf spark.ps.jars=$SONA_ANGEL_JARS --conf spark.ps.memory=5g --conf spark.ps.log.level=INFO --archives hdfs://XXX-1v/home/yarn/pytorch-on-angel/torchlib.zip#torch --conf spark.driver.extraJavaOptions=-Djava.library.path=$JAVA_LIBRARY_PATH:.:/home/work/software/angel/bin/torch/lib --conf spark.executor.extraJavaOptions=-Djava.library.path=$JAVA_LIBRARY_PATH:.:/home/work/software/angel/bin/torch/lib --conf spark.executor.extraLibraryPath=/home/work/software/angel/bin/torch/lib --conf spark.driver.extraLibraryPath=/home/work/software/angel/bin/torch/lib --conf spark.executorEnv.OMP_NUM_THREADS=2 --conf spark.executorEnv.MKL_NUM_THREADS=2 --conf spark.executorEnv.JAVA_HOME=/home/work/software/java --conf spark.yarn.appMasterEnv.JAVA_HOME=/home/work/software/java --conf spark.hadoop.fs.defaultFS=hdfs://XXX-1v/ --jars $SONA_SPARK_JARS --name "deepfm for torch" --driver-memory 5g --num-executors 15 --executor-cores 5 --executor-memory 8g --class com.tencent.angel.pytorch.examples.supervised.RecommendationExample /home/work/software/angel/bin/pytorch-on-angel-0.2.1.jar.old.jar trainInput:$input batchSize:128 torchModelPath:/home/work/software/angel/bin/deepfm.pt stepSize:0.001 numEpoch:10 testRatio:0.1 angelModelOutputPath:$output mode:yarn-client `

看日志显示是保存模型的时候报错了,你可以看下ps端的日志

ouyangwen-it avatar Jan 13 '21 15:01 ouyangwen-it

你好,我最后把集群所有节点环境都配置了一下。yarn-client模式就可以用了。但是偶尔会报这个错,是什么原因导致的呢。

21/01/13 13:07:37 INFO Client: Application report for application_1609301285435_0588 (state: ACCEPTED)
21/01/13 13:07:38 INFO Client: Application report for application_1609301285435_0588 (state: ACCEPTED)
21/01/13 13:07:39 INFO Client: Application report for application_1609301285435_0588 (state: ACCEPTED)
21/01/13 13:07:40 INFO Client: Application report for application_1609301285435_0588 (state: ACCEPTED)
21/01/13 13:07:41 INFO Client: Application report for application_1609301285435_0588 (state: ACCEPTED)
21/01/13 13:07:42 INFO Client: Application report for application_1609301285435_0588 (state: ACCEPTED)
21/01/13 13:07:43 INFO Client: Application report for application_1609301285435_0588 (state: ACCEPTED)
21/01/13 13:07:44 INFO Client: Application report for application_1609301285435_0588 (state: ACCEPTED)
21/01/13 13:07:45 INFO Client: Application report for application_1609301285435_0588 (state: RUNNING)
21/01/13 13:07:45 INFO Client: 
	 client token: N/A
	 diagnostics: N/A
	 ApplicationMaster host: 10.x.121.219
	 ApplicationMaster RPC port: -1
	 queue: root.default
	 start time: 1610514454704
	 final status: UNDEFINED
	 tracking URL: http://xxx.net:8888/proxy/application_1609301285435_0588/
	 user: user-2001
numDataPartitions=7500
numDataPartitions=7500
type: BIAS_WEIGHT_EMBEDDING_MATS name:DeepFM mats_dims: 130,10,1,1,10,5,1,1,5,1,1,1
optimizer: AsyncAdam eta=0.001 decay=0.001
from driver start Angel PS! 
AppMaster capability = <memory:2048, vCores:1, gCores:0>
validate_auc=0.8820555586167144 time=12161ms                                     
train_auc=0.8922190567461854 validate_auc=0.8954355298794986 time=3696ms        
train_auc=0.9007353906278361 validate_auc=0.9038945708702969 time=2840ms        
train_auc=0.9039252388499588 validate_auc=0.9064111964938902 time=2716ms        
train_auc=0.9068299665301356 validate_auc=0.9090327256697933 time=2602ms        
train_auc=0.9075874962681636 validate_auc=0.909377106018285 time=2575ms         
train_auc=0.9093007058840281 validate_auc=0.9113024570743141 time=2535ms        
train_auc=0.9108590019247338 validate_auc=0.9131101889959353 time=2603ms        
train_auc=0.9103151686457647 validate_auc=0.9114020624674163 time=2616ms        
train_auc=0.9114118253731684 validate_auc=0.9126884555230131 time=2597ms        
Exception in thread "main" com.tencent.angel.exception.AngelException: Model save falied Detail failed log:
ParameterServer_0:save task com.tencent.angel.model.PSMatricesSaveContext@6fc796e8 failed:java.io.IOException: java.io.IOException: SAVE matrix failed:java.io.IOException: SAVE model partitions failed.null
	at com.tencent.angel.model.output.format.MatrixFormatImpl.save(MatrixFormatImpl.java:135)
	at com.tencent.angel.ps.io.PSModelIOExecutor.saveMatrix(PSModelIOExecutor.java:237)
	at com.tencent.angel.ps.io.PSModelIOExecutor.process(PSModelIOExecutor.java:216)
	at com.tencent.angel.ps.io.PSModelIOExecutor.access$000(PSModelIOExecutor.java:39)
	at com.tencent.angel.ps.io.PSModelIOExecutor$MatrixDiskIOOp.compute(PSModelIOExecutor.java:195)
	at java.util.concurrent.RecursiveAction.exec(RecursiveAction.java:189)
	at java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:289)
	at java.util.concurrent.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1056)
	at java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1692)
	at java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:157)

	at com.tencent.angel.ps.io.PSModelIOExecutor.save(PSModelIOExecutor.java:154)
	at com.tencent.angel.ps.io.PSModelIOExecutor.save(PSModelIOExecutor.java:121)
	at com.tencent.angel.ps.io.save.PSModelSaver.lambda$save$9(PSModelSaver.java:184)
	at java.lang.Thread.run(Thread.java:748)
Caused by: java.io.IOException: SAVE matrix failed:java.io.IOException: SAVE model partitions failed.null
	at com.tencent.angel.model.output.format.MatrixFormatImpl.save(MatrixFormatImpl.java:135)
	at com.tencent.angel.ps.io.PSModelIOExecutor.saveMatrix(PSModelIOExecutor.java:237)
	at com.tencent.angel.ps.io.PSModelIOExecutor.process(PSModelIOExecutor.java:216)
	at com.tencent.angel.ps.io.PSModelIOExecutor.access$000(PSModelIOExecutor.java:39)
	at com.tencent.angel.ps.io.PSModelIOExecutor$MatrixDiskIOOp.compute(PSModelIOExecutor.java:195)
	at java.util.concurrent.RecursiveAction.exec(RecursiveAction.java:189)
	at java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:289)
	at java.util.concurrent.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1056)
	at java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1692)
	at java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:157)

	at com.tencent.angel.ps.io.PSModelIOExecutor.save(PSModelIOExecutor.java:151)
	... 3 more


	at com.tencent.angel.client.AngelClient.isSaveCompleted(AngelClient.java:670)
	at com.tencent.angel.client.AngelClient.save(AngelClient.java:381)
	at com.tencent.angel.client.AngelPSClient.save(AngelPSClient.java:146)
	at com.tencent.angel.spark.context.AngelPSContext$.save(AngelPSContext.scala:300)
	at com.tencent.angel.spark.context.AngelPSContext.save(AngelPSContext.scala:258)
	at com.tencent.angel.pytorch.recommendation.RecommendPSModel.savePSModel(RecommendPSModel.scala:234)
	at com.tencent.angel.pytorch.examples.supervised.RecommendationExample$.main(RecommendationExample.scala:88)
	at com.tencent.angel.pytorch.examples.supervised.RecommendationExample.main(RecommendationExample.scala)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
	at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:852)
	at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:161)
	at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:184)
	at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:86)
	at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:927)
	at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:936)
	at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

input=hdfs://XXX-1v/home/yarn/pytorch-on-angel/census_148d_train.libsvm.tmp output=hdfs://XXX-1v/home/user/pytorch-on-angel/${DT} source /home/work/software/angel/bin/spark-on-angel-env.sh echo "------------------" JAVA_LIBRARY_PATH=/home/work/software/angel/lib:/home/work/software/java/lib echo $JAVA_LIBRARY_PATH $SPARK_HOME/bin/spark-submit --conf spark.ps.instances=15 --conf spark.ps.cores=1 --conf spark.ps.jars=$SONA_ANGEL_JARS --conf spark.ps.memory=5g --conf spark.ps.log.level=INFO --archives hdfs://XXX-1v/home/yarn/pytorch-on-angel/torchlib.zip#torch --conf spark.driver.extraJavaOptions=-Djava.library.path=$JAVA_LIBRARY_PATH:.:/home/work/software/angel/bin/torch/lib --conf spark.executor.extraJavaOptions=-Djava.library.path=$JAVA_LIBRARY_PATH:.:/home/work/software/angel/bin/torch/lib --conf spark.executor.extraLibraryPath=/home/work/software/angel/bin/torch/lib --conf spark.driver.extraLibraryPath=/home/work/software/angel/bin/torch/lib --conf spark.executorEnv.OMP_NUM_THREADS=2 --conf spark.executorEnv.MKL_NUM_THREADS=2 --conf spark.executorEnv.JAVA_HOME=/home/work/software/java --conf spark.yarn.appMasterEnv.JAVA_HOME=/home/work/software/java --conf spark.hadoop.fs.defaultFS=hdfs://XXX-1v/ --jars $SONA_SPARK_JARS --name "deepfm for torch" --driver-memory 5g --num-executors 15 --executor-cores 5 --executor-memory 8g --class com.tencent.angel.pytorch.examples.supervised.RecommendationExample /home/work/software/angel/bin/pytorch-on-angel-0.2.1.jar.old.jar trainInput:$input batchSize:128 torchModelPath:/home/work/software/angel/bin/deepfm.pt stepSize:0.001 numEpoch:10 testRatio:0.1 angelModelOutputPath:$output mode:yarn-client

看日志显示是保存模型的时候报错了,你可以看下ps端的日志

ps日志报错如下: image image image image

fy88fy avatar Jan 15 '21 05:01 fy88fy

查看具体的出错的ps ParameterServer_0的日志:查看方法参考文档:https://github.com/Angel-ML/angel/wiki/%E5%B7%A5%E7%A8%8B%E5%B8%B8%E8%A7%81%E9%97%AE%E9%A2%98

ouyangwen-it avatar Jan 15 '21 11:01 ouyangwen-it