TonY icon indicating copy to clipboard operation
TonY copied to clipboard

tony-core runtime error

Open tonywang-sh opened this issue 3 years ago • 20 comments

There are error messages about tony.TonyClient when runing tony task on yarn and hadoop 3.2.2. The error messages are as the below. How to deal with these errors?

2022-07-26 06:35:41,245 WARN ipc.Client: Exception encountered while connecting to the server org.apache.hadoop.security.AccessControlException: Client cannot authenticate via:[KERBEROS] at org.apache.hadoop.security.SaslRpcClient.selectSaslClient(SaslRpcClient.java:173) at org.apache.hadoop.security.SaslRpcClient.saslConnect(SaslRpcClient.java:390) at org.apache.hadoop.ipc.Client$Connection.setupSaslConnection(Client.java:622) at org.apache.hadoop.ipc.Client$Connection.access$2300(Client.java:413) at org.apache.hadoop.ipc.Client$Connection$2.run(Client.java:822) at org.apache.hadoop.ipc.Client$Connection$2.run(Client.java:818) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1762) at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:818) at org.apache.hadoop.ipc.Client$Connection.access$3800(Client.java:413) at org.apache.hadoop.ipc.Client.getConnection(Client.java:1636) at org.apache.hadoop.ipc.Client.call(Client.java:1452) at org.apache.hadoop.ipc.Client.call(Client.java:1405) at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:233) at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:118) at com.sun.proxy.$Proxy17.getTaskInfos(Unknown Source) at com.linkedin.tony.rpc.impl.pb.client.TensorFlowClusterPBClientImpl.getTaskInfos(TensorFlowClusterPBClientImpl.java:77) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:422) at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:165) at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:157) at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:95) at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:359) at com.sun.proxy.$Proxy18.getTaskInfos(Unknown Source) at com.linkedin.tony.rpc.impl.ApplicationRpcClient.getTaskInfos(ApplicationRpcClient.java:82) at com.linkedin.tony.TonyClient.updateTaskInfoAndReturn(TonyClient.java:1192) at com.linkedin.tony.TonyClient.monitorApplication(TonyClient.java:1046) at com.linkedin.tony.TonyClient.run(TonyClient.java:225) at com.linkedin.tony.TonyClient.start(TonyClient.java:1293) at java.lang.Thread.run(Thread.java:748)

tonywang-sh avatar Jul 26 '22 07:07 tonywang-sh

If u submit tony app to secured cluster, the machine must be certified, which means keytab or principle must be provided.

I think you could use this machine to submit spark app for test. If it's ok, the tony app also can be submitted to cluster.

zuston avatar Jul 29 '22 02:07 zuston

Thanks for your reply. The cluster is hadoop 3.2.2 with kerberos, and I tried spark example successfully. I tried minist-tensorflow example according to the guide, https://github.com/tony-framework/TonY/tree/master/tony-examples/mnist-tensorflow, but it failed. Do I need any other setting or configuration for this task?

tonywang-sh avatar Jul 29 '22 09:07 tonywang-sh

Please attach the detailed error log and submit cli command args/ tony.xml and so on.

zuston avatar Jul 29 '22 10:07 zuston

cli command: #!/usr/bin/env bash java -cp hadoop classpath:/data/tony-dist/tony-cli-0.5.3-uber.jar com.linkedin.tony.cli.ClusterSubmitter
--python_venv=/data/venv/myvenv.zip
--src_dir=/data/tony-dist/mnist-tensorflow
--executes=mnist_distributed.py \ # relative path inside src/ --task_params="--steps 1000 --data_dir /user/test/tony/data --working_dir /user/test/tony/model" \ # You can use your HDFS path here. --conf_file=/data/tony-dist/tony.xml
--python_binary_path=venv/bin/python # relative path inside venv.zip

tony.xml, image

error logs as the below: AM Container for appattempt_1657011602166_1367_000002 exited with exitCode: 1 Failing this attempt.Diagnostics: [2022-08-03 13:41:09.319]Exception from container-launch. Container id: container_e94_1657011602166_1367_02_000001 Exit code: 1 Exception message: Launch container failed Shell output: main : command provided 1 main : run as user is test main : requested yarn user is test Getting exit code file... Creating script paths... Writing pid file... Writing to tmp file /data1/yarn/nm/nmPrivate/application_1657011602166_1367/container_e94_1657011602166_1367_02_000001/container_e94_1657011602166_1367_02_000001.pid.tmp Writing to cgroup task files... Creating local dirs... Launching container... [2022-08-03 13:41:09.321]Container exited with a non-zero exit code 1. Error file: prelaunch.err. Last 4096 bytes of prelaunch.err : Last 4096 bytes of amstderr.log : Error: A JNI error has occurred, please check your installation and try again Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/hadoop/fs/FSDataOutputStream at java.lang.Class.getDeclaredMethods0(Native Method) at java.lang.Class.privateGetDeclaredMethods(Class.java:2701) at java.lang.Class.privateGetMethodRecursive(Class.java:3048) at java.lang.Class.getMethod0(Class.java:3018) at java.lang.Class.getMethod(Class.java:1784) at sun.launcher.LauncherHelper.validateMainClass(LauncherHelper.java:544) at sun.launcher.LauncherHelper.checkAndLoadMain(LauncherHelper.java:526) Caused by: java.lang.ClassNotFoundException: org.apache.hadoop.fs.FSDataOutputStream at java.net.URLClassLoader.findClass(URLClassLoader.java:382) at java.lang.ClassLoader.loadClass(ClassLoader.java:424) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:349) at java.lang.ClassLoader.loadClass(ClassLoader.java:357) ... 7 more

tonywang-sh avatar Aug 03 '22 06:08 tonywang-sh

Is the same problem? https://github.com/tony-framework/TonY/issues/672

It looks the nodemanager machine don't have the complete hadoop environment.

zuston avatar Aug 08 '22 02:08 zuston

Got it, I have updated hadoop environment, and it reported python error as the below. image

The error: ModuleNotFoundError: No module named 'contextlib'

tonywang-sh avatar Aug 08 '22 09:08 tonywang-sh

You should package your pyenv zip at linux system machine same as the NM system. @tonywang-sh

zuston avatar Aug 08 '22 10:08 zuston

My package pyenv is set at ubuntu 18.04 system with anaconda according to the guide https://github.com/tony-framework/TonY/tree/master/tony-examples/mnist-tensorflow. Do you have another guide about setting up nomachine system package env to package this pyenv zip? Thanks.

tonywang-sh avatar Aug 09 '22 01:08 tonywang-sh

Conda is also OK. If you want to check whether the env is OK, you could launch it in local machine.

zuston avatar Aug 09 '22 02:08 zuston

I used anaconda to package virtualenv python and obtained virtualenv pyenv zip, but this pyenv zip can not work at worker nodes. Is it right method?

tonywang-sh avatar Aug 09 '22 02:08 tonywang-sh

Does this pyenv can be used in your local machine? You'd better to pre-check

zuston avatar Aug 09 '22 04:08 zuston

It worked in local machine by using "ven/bin/python " cmd line, but failed in remote worker node by submitting task with TonY script.

tonywang-sh avatar Aug 09 '22 05:08 tonywang-sh

I guess this is caused by your local machine' env is not consistent with the nodemanager.

zuston avatar Aug 10 '22 07:08 zuston

If pyenv is packaged by virtualenv or anaconda, does it need to activate this pyenv python environment at the worker node, such as the comand, 'venv/bin/activate' before the task start at the worker. But I didn't find this "activate" operation in TonY project.

tonywang-sh avatar Aug 10 '22 07:08 tonywang-sh