PyTorch-On-Angel
PyTorch-On-Angel copied to clipboard
报错:NoClassDefFoundError: com/tencent/angel/spark/ml/graph/params/HasBatchSize
你好,问一下? 我在测试使用PyTorch-On-Angel ,提交报错了,在代码里找到的是com.tencent.angel.graph.utils.params.HasBatchSize,会是什么原因导致的呢?
param mode = yarn-client Exception in thread "main" java.lang.NoClassDefFoundError: com/tencent/angel/spark/ml/graph/params/HasBatchSize at java.lang.ClassLoader.defineClass1(Native Method) at java.lang.ClassLoader.defineClass(ClassLoader.java:763) at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142) at java.net.URLClassLoader.defineClass(URLClassLoader.java:467) at java.net.URLClassLoader.access$100(URLClassLoader.java:73) at java.net.URLClassLoader$1.run(URLClassLoader.java:368) at java.net.URLClassLoader$1.run(URLClassLoader.java:362) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:361) at java.lang.ClassLoader.loadClass(ClassLoader.java:424) at java.lang.ClassLoader.loadClass(ClassLoader.java:357) at com.tencent.angel.pytorch.examples.supervised.RecommendationExample$.main(RecommendationExample.scala:53) at com.tencent.angel.pytorch.examples.supervised.RecommendationExample.main(RecommendationExample.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52) at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:852) at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:161) at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:184) at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:86) at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:927) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:936) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) Caused by: java.lang.ClassNotFoundException: com.tencent.angel.spark.ml.graph.params.HasBatchSize at java.net.URLClassLoader.findClass(URLClassLoader.java:381) at java.lang.ClassLoader.loadClass(ClassLoader.java:424) at java.lang.ClassLoader.loadClass(ClassLoader.java:357) ... 25 more
你使用的是哪个分支,spark是哪个版本,用最新的0.2.1分支试下
pytorch-on-angel用的是master的,spark是2.4.5U2,我们用的是纯cpu环境
pytorch-on-angel用的是master的,spark是2.4.5U2,我们用的是纯cpu环境
用0.2.1分支,angel环境用3.1.0的
angel现在用的也是master的,那个就是3.1.0的吧??还是非得下载branch3.1.0
angel现在用的也是master的,那个就是3.1.0的吧??还是非得下载branch3.1.0 master就可以了
这个classnotfound的问题已经解决了~我现在遇到了新的问题~
Exception in thread "main" java.lang.UnsatisfiedLinkError: no torch_angel in java.library.path
我给的参数如下:
input=hdfs://jinrong-hadoop3-1v/home/hdp-jinrong-stargraph/fanqizha/subgraph/input/20191231/
output=hdfs://jinrong-hadoop3-1v/home/hdp-jinrong-stargraph/jiadongxue/angel/model/20191231_deepfm/
source ./spark-on-angel-env.sh
echo "------------------"
#JAVA_LIBRARY_PATH=/home/work/software/java/lib
JAVA_LIBRARY_PATH=/home/work/software/angel/lib:/home/work/software/java/lib
echo $JAVA_LIBRARY_PATH
$SPARK_HOME/bin/spark-submit
--conf spark.ps.instances=5
--conf spark.ps.cores=1
--conf spark.ps.jars=$SONA_ANGEL_JARS
--conf spark.ps.memory=5g
--conf spark.ps.log.level=INFO
--conf spark.driver.extraJavaOptions=-Djava.library.path=$JAVA_LIBRARY_PATH:.:./torch/lib
--conf spark.executor.extraJavaOptions=-Djava.library.path=$JAVA_LIBRARY_PATH:.:./torch/lib
--conf spark.executor.extraLibraryPath=./torch/lib
--conf spark.driver.extraLibraryPath=./torch/lib
--conf spark.executorEnv.OMP_NUM_THREADS=2
--conf spark.executorEnv.MKL_NUM_THREADS=2
--conf spark.executorEnv.JAVA_HOME=/home/work/software/java
--conf spark.yarn.appMasterEnv.JAVA_HOME=/home/work/software/java
--jars $SONA_SPARK_JARS
--name "deepfm for torch on angel"
--archives /home/work/software/angel/bin/torchlib.zip#torch
--files /home/work/software/angel/bin/deepfm.pt
--driver-memory 5g
--num-executors 5
--executor-cores 1
--executor-memory 5g
--class com.tencent.angel.pytorch.examples.supervised.RecommendationExample
./pytorch-on-angel-0.2.1.jar
trainInput:$input batchSize:128 torchModelPath:deepfm.pt
stepSize:0.001 numEpoch:10 testRatio:0.1
angelModelOutputPath:$output mode:yarn-client \
报错如下:
这个classnotfound的问题已经解决了~我现在遇到了新的问题~
Exception in thread "main" java.lang.UnsatisfiedLinkError: no torch_angel in java.library.path
我给的参数如下: input=hdfs://jinrong-hadoop3-1v/home/hdp-jinrong-stargraph/fanqizha/subgraph/input/20191231/ output=hdfs://jinrong-hadoop3-1v/home/hdp-jinrong-stargraph/jiadongxue/angel/model/20191231_deepfm/ source ./spark-on-angel-env.sh echo "------------------" #JAVA_LIBRARY_PATH=/home/work/software/java/lib JAVA_LIBRARY_PATH=/home/work/software/angel/lib:/home/work/software/java/lib echo $JAVA_LIBRARY_PATH $SPARK_HOME/bin/spark-submit --conf spark.ps.instances=5 --conf spark.ps.cores=1 --conf spark.ps.jars=$SONA_ANGEL_JARS --conf spark.ps.memory=5g --conf spark.ps.log.level=INFO --conf spark.driver.extraJavaOptions=-Djava.library.path=$JAVA_LIBRARY_PATH:.:./torch/lib --conf spark.executor.extraJavaOptions=-Djava.library.path=$JAVA_LIBRARY_PATH:.:./torch/lib --conf spark.executor.extraLibraryPath=./torch/lib --conf spark.driver.extraLibraryPath=./torch/lib --conf spark.executorEnv.OMP_NUM_THREADS=2 --conf spark.executorEnv.MKL_NUM_THREADS=2 --conf spark.executorEnv.JAVA_HOME=/home/work/software/java --conf spark.yarn.appMasterEnv.JAVA_HOME=/home/work/software/java --jars $SONA_SPARK_JARS --name "deepfm for torch on angel" --archives /home/work/software/angel/bin/torchlib.zip#torch --files /home/work/software/angel/bin/deepfm.pt --driver-memory 5g --num-executors 5 --executor-cores 1 --executor-memory 5g --class com.tencent.angel.pytorch.examples.supervised.RecommendationExample ./pytorch-on-angel-0.2.1.jar trainInput:$input batchSize:128 torchModelPath:deepfm.pt stepSize:0.001 numEpoch:10 testRatio:0.1 angelModelOutputPath:$output mode:yarn-client \
报错如下:
应该是你的依赖包解压后的目录和你设置的不匹配,是这样的,spark的--archives参数会把你的hdfs上的压缩包解压到executor执行目录下,目录名是井号后面那个别名,目录应该是./torch/(你压缩包解压后的目录结构)
--archives跟的是hdfs路径
你好,我现在遇到个问题,是提交不到yarn上。找不到hdfs上的deepfm.pt文件,麻烦帮忙看一下。 脚本配置如下:
#!/bin/bash
input=hdfs://xxxx-1v/home/yarn/pytorch-on-angel/census_148d_train.libsvm.tmp
output=hdfs://xxxx-1v/home/hdp/jia/angel/model/20191231_louvain/
source ./spark-on-angel-env.sh
echo "------------------"
#JAVA_LIBRARY_PATH=/home/work/software/java/lib
JAVA_LIBRARY_PATH=/home/work/software/angel/lib:/home/work/software/java/lib
echo $JAVA_LIBRARY_PATH
$SPARK_HOME/bin/spark-submit \
--conf spark.ps.instances=5 \
--conf spark.ps.cores=1 \
--conf spark.ps.jars=$SONA_ANGEL_JARS \
--conf spark.ps.memory=5g \
--conf spark.ps.log.level=INFO \
--archives hdfs://XXXX-hadoop3-1v/home/yarn/pytorch-on-angel/torchlib.zip#torch \
--conf spark.driver.extraJavaOptions=-Djava.library.path=$JAVA_LIBRARY_PATH:.:./torch/lib \
--conf spark.executor.extraJavaOptions=-Djava.library.path=$JAVA_LIBRARY_PATH:.:./torch/lib \
--conf spark.executor.extraLibraryPath=./torch/lib \
--conf spark.driver.extraLibraryPath=./torch/lib \
--conf spark.executorEnv.OMP_NUM_THREADS=2 \
--conf spark.executorEnv.MKL_NUM_THREADS=2 \
--conf spark.executorEnv.JAVA_HOME=/home/work/software/java \
--conf spark.yarn.appMasterEnv.JAVA_HOME=/home/work/software/java \
--conf spark.hadoop.fs.defaultFS=hdfs://xxxx-hadoop3-1v/ \
--jars $SONA_SPARK_JARS \
--name "deepfm for torch on angel" \
--files deepfm.pt \
--driver-memory 5g \
--num-executors 5 \
--executor-cores 1 \
--executor-memory 5g \
--class com.tencent.angel.pytorch.examples.supervised.RecommendationExample \
./pytorch-on-angel-0.2.1.jar\
trainInput:$input batchSize:128 torchModelPath:deepfm.pt \
stepSize:0.001 numEpoch:10 testRatio:0.1 \
angelModelOutputPath:$output mode:yarn-client \
你spark用yarn-cluster模式提交试试呢
改为yarn-cluster报如下错误了:
你这个torchlib.zip压缩包解压的目录结构是什么样的
torchlib.zip解压开是lib目录,lib下是很多.a文件
torchlib.zip解压开是lib目录,lib下是很多.a文件
你可以在RecommendationExample里面把当前目录打印下看看吗,看有没有torch/lib
你好,我最后把集群所有节点环境都配置了一下。yarn-client模式就可以用了。但是偶尔会报这个错,是什么原因导致的呢。
21/01/13 13:07:37 INFO Client: Application report for application_1609301285435_0588 (state: ACCEPTED)
21/01/13 13:07:38 INFO Client: Application report for application_1609301285435_0588 (state: ACCEPTED)
21/01/13 13:07:39 INFO Client: Application report for application_1609301285435_0588 (state: ACCEPTED)
21/01/13 13:07:40 INFO Client: Application report for application_1609301285435_0588 (state: ACCEPTED)
21/01/13 13:07:41 INFO Client: Application report for application_1609301285435_0588 (state: ACCEPTED)
21/01/13 13:07:42 INFO Client: Application report for application_1609301285435_0588 (state: ACCEPTED)
21/01/13 13:07:43 INFO Client: Application report for application_1609301285435_0588 (state: ACCEPTED)
21/01/13 13:07:44 INFO Client: Application report for application_1609301285435_0588 (state: ACCEPTED)
21/01/13 13:07:45 INFO Client: Application report for application_1609301285435_0588 (state: RUNNING)
21/01/13 13:07:45 INFO Client:
client token: N/A
diagnostics: N/A
ApplicationMaster host: 10.x.121.219
ApplicationMaster RPC port: -1
queue: root.default
start time: 1610514454704
final status: UNDEFINED
tracking URL: http://xxx.net:8888/proxy/application_1609301285435_0588/
user: user-2001
numDataPartitions=7500
numDataPartitions=7500
type: BIAS_WEIGHT_EMBEDDING_MATS name:DeepFM mats_dims: 130,10,1,1,10,5,1,1,5,1,1,1
optimizer: AsyncAdam eta=0.001 decay=0.001
from driver start Angel PS!
AppMaster capability = <memory:2048, vCores:1, gCores:0>
validate_auc=0.8820555586167144 time=12161ms
train_auc=0.8922190567461854 validate_auc=0.8954355298794986 time=3696ms
train_auc=0.9007353906278361 validate_auc=0.9038945708702969 time=2840ms
train_auc=0.9039252388499588 validate_auc=0.9064111964938902 time=2716ms
train_auc=0.9068299665301356 validate_auc=0.9090327256697933 time=2602ms
train_auc=0.9075874962681636 validate_auc=0.909377106018285 time=2575ms
train_auc=0.9093007058840281 validate_auc=0.9113024570743141 time=2535ms
train_auc=0.9108590019247338 validate_auc=0.9131101889959353 time=2603ms
train_auc=0.9103151686457647 validate_auc=0.9114020624674163 time=2616ms
train_auc=0.9114118253731684 validate_auc=0.9126884555230131 time=2597ms
Exception in thread "main" com.tencent.angel.exception.AngelException: Model save falied Detail failed log:
ParameterServer_0:save task com.tencent.angel.model.PSMatricesSaveContext@6fc796e8 failed:java.io.IOException: java.io.IOException: SAVE matrix failed:java.io.IOException: SAVE model partitions failed.null
at com.tencent.angel.model.output.format.MatrixFormatImpl.save(MatrixFormatImpl.java:135)
at com.tencent.angel.ps.io.PSModelIOExecutor.saveMatrix(PSModelIOExecutor.java:237)
at com.tencent.angel.ps.io.PSModelIOExecutor.process(PSModelIOExecutor.java:216)
at com.tencent.angel.ps.io.PSModelIOExecutor.access$000(PSModelIOExecutor.java:39)
at com.tencent.angel.ps.io.PSModelIOExecutor$MatrixDiskIOOp.compute(PSModelIOExecutor.java:195)
at java.util.concurrent.RecursiveAction.exec(RecursiveAction.java:189)
at java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:289)
at java.util.concurrent.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1056)
at java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1692)
at java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:157)
at com.tencent.angel.ps.io.PSModelIOExecutor.save(PSModelIOExecutor.java:154)
at com.tencent.angel.ps.io.PSModelIOExecutor.save(PSModelIOExecutor.java:121)
at com.tencent.angel.ps.io.save.PSModelSaver.lambda$save$9(PSModelSaver.java:184)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.io.IOException: SAVE matrix failed:java.io.IOException: SAVE model partitions failed.null
at com.tencent.angel.model.output.format.MatrixFormatImpl.save(MatrixFormatImpl.java:135)
at com.tencent.angel.ps.io.PSModelIOExecutor.saveMatrix(PSModelIOExecutor.java:237)
at com.tencent.angel.ps.io.PSModelIOExecutor.process(PSModelIOExecutor.java:216)
at com.tencent.angel.ps.io.PSModelIOExecutor.access$000(PSModelIOExecutor.java:39)
at com.tencent.angel.ps.io.PSModelIOExecutor$MatrixDiskIOOp.compute(PSModelIOExecutor.java:195)
at java.util.concurrent.RecursiveAction.exec(RecursiveAction.java:189)
at java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:289)
at java.util.concurrent.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1056)
at java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1692)
at java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:157)
at com.tencent.angel.ps.io.PSModelIOExecutor.save(PSModelIOExecutor.java:151)
... 3 more
at com.tencent.angel.client.AngelClient.isSaveCompleted(AngelClient.java:670)
at com.tencent.angel.client.AngelClient.save(AngelClient.java:381)
at com.tencent.angel.client.AngelPSClient.save(AngelPSClient.java:146)
at com.tencent.angel.spark.context.AngelPSContext$.save(AngelPSContext.scala:300)
at com.tencent.angel.spark.context.AngelPSContext.save(AngelPSContext.scala:258)
at com.tencent.angel.pytorch.recommendation.RecommendPSModel.savePSModel(RecommendPSModel.scala:234)
at com.tencent.angel.pytorch.examples.supervised.RecommendationExample$.main(RecommendationExample.scala:88)
at com.tencent.angel.pytorch.examples.supervised.RecommendationExample.main(RecommendationExample.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:852)
at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:161)
at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:184)
at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:86)
at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:927)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:936)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
`input=hdfs://XXX-1v/home/yarn/pytorch-on-angel/census_148d_train.libsvm.tmp output=hdfs://XXX-1v/home/user/pytorch-on-angel/${DT} source /home/work/software/angel/bin/spark-on-angel-env.sh echo "------------------" JAVA_LIBRARY_PATH=/home/work/software/angel/lib:/home/work/software/java/lib
echo $JAVA_LIBRARY_PATH
$SPARK_HOME/bin/spark-submit
--conf spark.ps.instances=15
--conf spark.ps.cores=1
--conf spark.ps.jars=$SONA_ANGEL_JARS
--conf spark.ps.memory=5g
--conf spark.ps.log.level=INFO
--archives hdfs://XXX-1v/home/yarn/pytorch-on-angel/torchlib.zip#torch
--conf spark.driver.extraJavaOptions=-Djava.library.path=$JAVA_LIBRARY_PATH:.:/home/work/software/angel/bin/torch/lib
--conf spark.executor.extraJavaOptions=-Djava.library.path=$JAVA_LIBRARY_PATH:.:/home/work/software/angel/bin/torch/lib
--conf spark.executor.extraLibraryPath=/home/work/software/angel/bin/torch/lib
--conf spark.driver.extraLibraryPath=/home/work/software/angel/bin/torch/lib
--conf spark.executorEnv.OMP_NUM_THREADS=2
--conf spark.executorEnv.MKL_NUM_THREADS=2
--conf spark.executorEnv.JAVA_HOME=/home/work/software/java
--conf spark.yarn.appMasterEnv.JAVA_HOME=/home/work/software/java
--conf spark.hadoop.fs.defaultFS=hdfs://XXX-1v/
--jars $SONA_SPARK_JARS
--name "deepfm for torch"
--driver-memory 5g
--num-executors 15
--executor-cores 5
--executor-memory 8g
--class com.tencent.angel.pytorch.examples.supervised.RecommendationExample
/home/work/software/angel/bin/pytorch-on-angel-0.2.1.jar.old.jar
trainInput:$input batchSize:128 torchModelPath:/home/work/software/angel/bin/deepfm.pt
stepSize:0.001 numEpoch:10 testRatio:0.1
angelModelOutputPath:$output mode:yarn-client
`
你好,我最后把集群所有节点环境都配置了一下。yarn-client模式就可以用了。但是偶尔会报这个错,是什么原因导致的呢。
21/01/13 13:07:37 INFO Client: Application report for application_1609301285435_0588 (state: ACCEPTED) 21/01/13 13:07:38 INFO Client: Application report for application_1609301285435_0588 (state: ACCEPTED) 21/01/13 13:07:39 INFO Client: Application report for application_1609301285435_0588 (state: ACCEPTED) 21/01/13 13:07:40 INFO Client: Application report for application_1609301285435_0588 (state: ACCEPTED) 21/01/13 13:07:41 INFO Client: Application report for application_1609301285435_0588 (state: ACCEPTED) 21/01/13 13:07:42 INFO Client: Application report for application_1609301285435_0588 (state: ACCEPTED) 21/01/13 13:07:43 INFO Client: Application report for application_1609301285435_0588 (state: ACCEPTED) 21/01/13 13:07:44 INFO Client: Application report for application_1609301285435_0588 (state: ACCEPTED) 21/01/13 13:07:45 INFO Client: Application report for application_1609301285435_0588 (state: RUNNING) 21/01/13 13:07:45 INFO Client: client token: N/A diagnostics: N/A ApplicationMaster host: 10.x.121.219 ApplicationMaster RPC port: -1 queue: root.default start time: 1610514454704 final status: UNDEFINED tracking URL: http://xxx.net:8888/proxy/application_1609301285435_0588/ user: user-2001 numDataPartitions=7500 numDataPartitions=7500 type: BIAS_WEIGHT_EMBEDDING_MATS name:DeepFM mats_dims: 130,10,1,1,10,5,1,1,5,1,1,1 optimizer: AsyncAdam eta=0.001 decay=0.001 from driver start Angel PS! AppMaster capability = <memory:2048, vCores:1, gCores:0> validate_auc=0.8820555586167144 time=12161ms train_auc=0.8922190567461854 validate_auc=0.8954355298794986 time=3696ms train_auc=0.9007353906278361 validate_auc=0.9038945708702969 time=2840ms train_auc=0.9039252388499588 validate_auc=0.9064111964938902 time=2716ms train_auc=0.9068299665301356 validate_auc=0.9090327256697933 time=2602ms train_auc=0.9075874962681636 validate_auc=0.909377106018285 time=2575ms train_auc=0.9093007058840281 validate_auc=0.9113024570743141 time=2535ms train_auc=0.9108590019247338 validate_auc=0.9131101889959353 time=2603ms train_auc=0.9103151686457647 validate_auc=0.9114020624674163 time=2616ms train_auc=0.9114118253731684 validate_auc=0.9126884555230131 time=2597ms Exception in thread "main" com.tencent.angel.exception.AngelException: Model save falied Detail failed log: ParameterServer_0:save task com.tencent.angel.model.PSMatricesSaveContext@6fc796e8 failed:java.io.IOException: java.io.IOException: SAVE matrix failed:java.io.IOException: SAVE model partitions failed.null at com.tencent.angel.model.output.format.MatrixFormatImpl.save(MatrixFormatImpl.java:135) at com.tencent.angel.ps.io.PSModelIOExecutor.saveMatrix(PSModelIOExecutor.java:237) at com.tencent.angel.ps.io.PSModelIOExecutor.process(PSModelIOExecutor.java:216) at com.tencent.angel.ps.io.PSModelIOExecutor.access$000(PSModelIOExecutor.java:39) at com.tencent.angel.ps.io.PSModelIOExecutor$MatrixDiskIOOp.compute(PSModelIOExecutor.java:195) at java.util.concurrent.RecursiveAction.exec(RecursiveAction.java:189) at java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:289) at java.util.concurrent.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1056) at java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1692) at java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:157) at com.tencent.angel.ps.io.PSModelIOExecutor.save(PSModelIOExecutor.java:154) at com.tencent.angel.ps.io.PSModelIOExecutor.save(PSModelIOExecutor.java:121) at com.tencent.angel.ps.io.save.PSModelSaver.lambda$save$9(PSModelSaver.java:184) at java.lang.Thread.run(Thread.java:748) Caused by: java.io.IOException: SAVE matrix failed:java.io.IOException: SAVE model partitions failed.null at com.tencent.angel.model.output.format.MatrixFormatImpl.save(MatrixFormatImpl.java:135) at com.tencent.angel.ps.io.PSModelIOExecutor.saveMatrix(PSModelIOExecutor.java:237) at com.tencent.angel.ps.io.PSModelIOExecutor.process(PSModelIOExecutor.java:216) at com.tencent.angel.ps.io.PSModelIOExecutor.access$000(PSModelIOExecutor.java:39) at com.tencent.angel.ps.io.PSModelIOExecutor$MatrixDiskIOOp.compute(PSModelIOExecutor.java:195) at java.util.concurrent.RecursiveAction.exec(RecursiveAction.java:189) at java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:289) at java.util.concurrent.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1056) at java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1692) at java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:157) at com.tencent.angel.ps.io.PSModelIOExecutor.save(PSModelIOExecutor.java:151) ... 3 more at com.tencent.angel.client.AngelClient.isSaveCompleted(AngelClient.java:670) at com.tencent.angel.client.AngelClient.save(AngelClient.java:381) at com.tencent.angel.client.AngelPSClient.save(AngelPSClient.java:146) at com.tencent.angel.spark.context.AngelPSContext$.save(AngelPSContext.scala:300) at com.tencent.angel.spark.context.AngelPSContext.save(AngelPSContext.scala:258) at com.tencent.angel.pytorch.recommendation.RecommendPSModel.savePSModel(RecommendPSModel.scala:234) at com.tencent.angel.pytorch.examples.supervised.RecommendationExample$.main(RecommendationExample.scala:88) at com.tencent.angel.pytorch.examples.supervised.RecommendationExample.main(RecommendationExample.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52) at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:852) at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:161) at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:184) at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:86) at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:927) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:936) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
`input=hdfs://XXX-1v/home/yarn/pytorch-on-angel/census_148d_train.libsvm.tmp output=hdfs://XXX-1v/home/user/pytorch-on-angel/${DT} source /home/work/software/angel/bin/spark-on-angel-env.sh echo "------------------" JAVA_LIBRARY_PATH=/home/work/software/angel/lib:/home/work/software/java/lib
echo $JAVA_LIBRARY_PATH $SPARK_HOME/bin/spark-submit --conf spark.ps.instances=15 --conf spark.ps.cores=1 --conf spark.ps.jars=$SONA_ANGEL_JARS --conf spark.ps.memory=5g --conf spark.ps.log.level=INFO --archives hdfs://XXX-1v/home/yarn/pytorch-on-angel/torchlib.zip#torch --conf spark.driver.extraJavaOptions=-Djava.library.path=$JAVA_LIBRARY_PATH:.:/home/work/software/angel/bin/torch/lib --conf spark.executor.extraJavaOptions=-Djava.library.path=$JAVA_LIBRARY_PATH:.:/home/work/software/angel/bin/torch/lib --conf spark.executor.extraLibraryPath=/home/work/software/angel/bin/torch/lib --conf spark.driver.extraLibraryPath=/home/work/software/angel/bin/torch/lib --conf spark.executorEnv.OMP_NUM_THREADS=2 --conf spark.executorEnv.MKL_NUM_THREADS=2 --conf spark.executorEnv.JAVA_HOME=/home/work/software/java --conf spark.yarn.appMasterEnv.JAVA_HOME=/home/work/software/java --conf spark.hadoop.fs.defaultFS=hdfs://XXX-1v/ --jars $SONA_SPARK_JARS --name "deepfm for torch" --driver-memory 5g --num-executors 15 --executor-cores 5 --executor-memory 8g --class com.tencent.angel.pytorch.examples.supervised.RecommendationExample /home/work/software/angel/bin/pytorch-on-angel-0.2.1.jar.old.jar trainInput:$input batchSize:128 torchModelPath:/home/work/software/angel/bin/deepfm.pt stepSize:0.001 numEpoch:10 testRatio:0.1 angelModelOutputPath:$output mode:yarn-client `
看日志显示是保存模型的时候报错了,你可以看下ps端的日志
你好,我最后把集群所有节点环境都配置了一下。yarn-client模式就可以用了。但是偶尔会报这个错,是什么原因导致的呢。
21/01/13 13:07:37 INFO Client: Application report for application_1609301285435_0588 (state: ACCEPTED) 21/01/13 13:07:38 INFO Client: Application report for application_1609301285435_0588 (state: ACCEPTED) 21/01/13 13:07:39 INFO Client: Application report for application_1609301285435_0588 (state: ACCEPTED) 21/01/13 13:07:40 INFO Client: Application report for application_1609301285435_0588 (state: ACCEPTED) 21/01/13 13:07:41 INFO Client: Application report for application_1609301285435_0588 (state: ACCEPTED) 21/01/13 13:07:42 INFO Client: Application report for application_1609301285435_0588 (state: ACCEPTED) 21/01/13 13:07:43 INFO Client: Application report for application_1609301285435_0588 (state: ACCEPTED) 21/01/13 13:07:44 INFO Client: Application report for application_1609301285435_0588 (state: ACCEPTED) 21/01/13 13:07:45 INFO Client: Application report for application_1609301285435_0588 (state: RUNNING) 21/01/13 13:07:45 INFO Client: client token: N/A diagnostics: N/A ApplicationMaster host: 10.x.121.219 ApplicationMaster RPC port: -1 queue: root.default start time: 1610514454704 final status: UNDEFINED tracking URL: http://xxx.net:8888/proxy/application_1609301285435_0588/ user: user-2001 numDataPartitions=7500 numDataPartitions=7500 type: BIAS_WEIGHT_EMBEDDING_MATS name:DeepFM mats_dims: 130,10,1,1,10,5,1,1,5,1,1,1 optimizer: AsyncAdam eta=0.001 decay=0.001 from driver start Angel PS! AppMaster capability = <memory:2048, vCores:1, gCores:0> validate_auc=0.8820555586167144 time=12161ms train_auc=0.8922190567461854 validate_auc=0.8954355298794986 time=3696ms train_auc=0.9007353906278361 validate_auc=0.9038945708702969 time=2840ms train_auc=0.9039252388499588 validate_auc=0.9064111964938902 time=2716ms train_auc=0.9068299665301356 validate_auc=0.9090327256697933 time=2602ms train_auc=0.9075874962681636 validate_auc=0.909377106018285 time=2575ms train_auc=0.9093007058840281 validate_auc=0.9113024570743141 time=2535ms train_auc=0.9108590019247338 validate_auc=0.9131101889959353 time=2603ms train_auc=0.9103151686457647 validate_auc=0.9114020624674163 time=2616ms train_auc=0.9114118253731684 validate_auc=0.9126884555230131 time=2597ms Exception in thread "main" com.tencent.angel.exception.AngelException: Model save falied Detail failed log: ParameterServer_0:save task com.tencent.angel.model.PSMatricesSaveContext@6fc796e8 failed:java.io.IOException: java.io.IOException: SAVE matrix failed:java.io.IOException: SAVE model partitions failed.null at com.tencent.angel.model.output.format.MatrixFormatImpl.save(MatrixFormatImpl.java:135) at com.tencent.angel.ps.io.PSModelIOExecutor.saveMatrix(PSModelIOExecutor.java:237) at com.tencent.angel.ps.io.PSModelIOExecutor.process(PSModelIOExecutor.java:216) at com.tencent.angel.ps.io.PSModelIOExecutor.access$000(PSModelIOExecutor.java:39) at com.tencent.angel.ps.io.PSModelIOExecutor$MatrixDiskIOOp.compute(PSModelIOExecutor.java:195) at java.util.concurrent.RecursiveAction.exec(RecursiveAction.java:189) at java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:289) at java.util.concurrent.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1056) at java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1692) at java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:157) at com.tencent.angel.ps.io.PSModelIOExecutor.save(PSModelIOExecutor.java:154) at com.tencent.angel.ps.io.PSModelIOExecutor.save(PSModelIOExecutor.java:121) at com.tencent.angel.ps.io.save.PSModelSaver.lambda$save$9(PSModelSaver.java:184) at java.lang.Thread.run(Thread.java:748) Caused by: java.io.IOException: SAVE matrix failed:java.io.IOException: SAVE model partitions failed.null at com.tencent.angel.model.output.format.MatrixFormatImpl.save(MatrixFormatImpl.java:135) at com.tencent.angel.ps.io.PSModelIOExecutor.saveMatrix(PSModelIOExecutor.java:237) at com.tencent.angel.ps.io.PSModelIOExecutor.process(PSModelIOExecutor.java:216) at com.tencent.angel.ps.io.PSModelIOExecutor.access$000(PSModelIOExecutor.java:39) at com.tencent.angel.ps.io.PSModelIOExecutor$MatrixDiskIOOp.compute(PSModelIOExecutor.java:195) at java.util.concurrent.RecursiveAction.exec(RecursiveAction.java:189) at java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:289) at java.util.concurrent.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1056) at java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1692) at java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:157) at com.tencent.angel.ps.io.PSModelIOExecutor.save(PSModelIOExecutor.java:151) ... 3 more at com.tencent.angel.client.AngelClient.isSaveCompleted(AngelClient.java:670) at com.tencent.angel.client.AngelClient.save(AngelClient.java:381) at com.tencent.angel.client.AngelPSClient.save(AngelPSClient.java:146) at com.tencent.angel.spark.context.AngelPSContext$.save(AngelPSContext.scala:300) at com.tencent.angel.spark.context.AngelPSContext.save(AngelPSContext.scala:258) at com.tencent.angel.pytorch.recommendation.RecommendPSModel.savePSModel(RecommendPSModel.scala:234) at com.tencent.angel.pytorch.examples.supervised.RecommendationExample$.main(RecommendationExample.scala:88) at com.tencent.angel.pytorch.examples.supervised.RecommendationExample.main(RecommendationExample.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52) at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:852) at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:161) at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:184) at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:86) at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:927) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:936) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
input=hdfs://XXX-1v/home/yarn/pytorch-on-angel/census_148d_train.libsvm.tmp output=hdfs://XXX-1v/home/user/pytorch-on-angel/${DT} source /home/work/software/angel/bin/spark-on-angel-env.sh echo "------------------" JAVA_LIBRARY_PATH=/home/work/software/angel/lib:/home/work/software/java/lib echo $JAVA_LIBRARY_PATH $SPARK_HOME/bin/spark-submit --conf spark.ps.instances=15 --conf spark.ps.cores=1 --conf spark.ps.jars=$SONA_ANGEL_JARS --conf spark.ps.memory=5g --conf spark.ps.log.level=INFO --archives hdfs://XXX-1v/home/yarn/pytorch-on-angel/torchlib.zip#torch --conf spark.driver.extraJavaOptions=-Djava.library.path=$JAVA_LIBRARY_PATH:.:/home/work/software/angel/bin/torch/lib --conf spark.executor.extraJavaOptions=-Djava.library.path=$JAVA_LIBRARY_PATH:.:/home/work/software/angel/bin/torch/lib --conf spark.executor.extraLibraryPath=/home/work/software/angel/bin/torch/lib --conf spark.driver.extraLibraryPath=/home/work/software/angel/bin/torch/lib --conf spark.executorEnv.OMP_NUM_THREADS=2 --conf spark.executorEnv.MKL_NUM_THREADS=2 --conf spark.executorEnv.JAVA_HOME=/home/work/software/java --conf spark.yarn.appMasterEnv.JAVA_HOME=/home/work/software/java --conf spark.hadoop.fs.defaultFS=hdfs://XXX-1v/ --jars $SONA_SPARK_JARS --name "deepfm for torch" --driver-memory 5g --num-executors 15 --executor-cores 5 --executor-memory 8g --class com.tencent.angel.pytorch.examples.supervised.RecommendationExample /home/work/software/angel/bin/pytorch-on-angel-0.2.1.jar.old.jar trainInput:$input batchSize:128 torchModelPath:/home/work/software/angel/bin/deepfm.pt stepSize:0.001 numEpoch:10 testRatio:0.1 angelModelOutputPath:$output mode:yarn-client
看日志显示是保存模型的时候报错了,你可以看下ps端的日志
ps日志报错如下:
查看具体的出错的ps ParameterServer_0的日志:查看方法参考文档:https://github.com/Angel-ML/angel/wiki/%E5%B7%A5%E7%A8%8B%E5%B8%B8%E8%A7%81%E9%97%AE%E9%A2%98