tony-core runtime error
There are error messages about tony.TonyClient when runing tony task on yarn and hadoop 3.2.2. The error messages are as the below. How to deal with these errors?
2022-07-26 06:35:41,245 WARN ipc.Client: Exception encountered while connecting to the server org.apache.hadoop.security.AccessControlException: Client cannot authenticate via:[KERBEROS] at org.apache.hadoop.security.SaslRpcClient.selectSaslClient(SaslRpcClient.java:173) at org.apache.hadoop.security.SaslRpcClient.saslConnect(SaslRpcClient.java:390) at org.apache.hadoop.ipc.Client$Connection.setupSaslConnection(Client.java:622) at org.apache.hadoop.ipc.Client$Connection.access$2300(Client.java:413) at org.apache.hadoop.ipc.Client$Connection$2.run(Client.java:822) at org.apache.hadoop.ipc.Client$Connection$2.run(Client.java:818) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1762) at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:818) at org.apache.hadoop.ipc.Client$Connection.access$3800(Client.java:413) at org.apache.hadoop.ipc.Client.getConnection(Client.java:1636) at org.apache.hadoop.ipc.Client.call(Client.java:1452) at org.apache.hadoop.ipc.Client.call(Client.java:1405) at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:233) at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:118) at com.sun.proxy.$Proxy17.getTaskInfos(Unknown Source) at com.linkedin.tony.rpc.impl.pb.client.TensorFlowClusterPBClientImpl.getTaskInfos(TensorFlowClusterPBClientImpl.java:77) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:422) at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:165) at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:157) at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:95) at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:359) at com.sun.proxy.$Proxy18.getTaskInfos(Unknown Source) at com.linkedin.tony.rpc.impl.ApplicationRpcClient.getTaskInfos(ApplicationRpcClient.java:82) at com.linkedin.tony.TonyClient.updateTaskInfoAndReturn(TonyClient.java:1192) at com.linkedin.tony.TonyClient.monitorApplication(TonyClient.java:1046) at com.linkedin.tony.TonyClient.run(TonyClient.java:225) at com.linkedin.tony.TonyClient.start(TonyClient.java:1293) at java.lang.Thread.run(Thread.java:748)
If u submit tony app to secured cluster, the machine must be certified, which means keytab or principle must be provided.
I think you could use this machine to submit spark app for test. If it's ok, the tony app also can be submitted to cluster.
Thanks for your reply. The cluster is hadoop 3.2.2 with kerberos, and I tried spark example successfully. I tried minist-tensorflow example according to the guide, https://github.com/tony-framework/TonY/tree/master/tony-examples/mnist-tensorflow, but it failed. Do I need any other setting or configuration for this task?
Please attach the detailed error log and submit cli command args/ tony.xml and so on.
cli command:
#!/usr/bin/env bash
java -cp hadoop classpath:/data/tony-dist/tony-cli-0.5.3-uber.jar com.linkedin.tony.cli.ClusterSubmitter
--python_venv=/data/venv/myvenv.zip
--src_dir=/data/tony-dist/mnist-tensorflow
--executes=mnist_distributed.py \ # relative path inside src/
--task_params="--steps 1000 --data_dir /user/test/tony/data --working_dir /user/test/tony/model" \ # You can use your HDFS path here.
--conf_file=/data/tony-dist/tony.xml
--python_binary_path=venv/bin/python # relative path inside venv.zip
tony.xml,

error logs as the below: AM Container for appattempt_1657011602166_1367_000002 exited with exitCode: 1 Failing this attempt.Diagnostics: [2022-08-03 13:41:09.319]Exception from container-launch. Container id: container_e94_1657011602166_1367_02_000001 Exit code: 1 Exception message: Launch container failed Shell output: main : command provided 1 main : run as user is test main : requested yarn user is test Getting exit code file... Creating script paths... Writing pid file... Writing to tmp file /data1/yarn/nm/nmPrivate/application_1657011602166_1367/container_e94_1657011602166_1367_02_000001/container_e94_1657011602166_1367_02_000001.pid.tmp Writing to cgroup task files... Creating local dirs... Launching container... [2022-08-03 13:41:09.321]Container exited with a non-zero exit code 1. Error file: prelaunch.err. Last 4096 bytes of prelaunch.err : Last 4096 bytes of amstderr.log : Error: A JNI error has occurred, please check your installation and try again Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/hadoop/fs/FSDataOutputStream at java.lang.Class.getDeclaredMethods0(Native Method) at java.lang.Class.privateGetDeclaredMethods(Class.java:2701) at java.lang.Class.privateGetMethodRecursive(Class.java:3048) at java.lang.Class.getMethod0(Class.java:3018) at java.lang.Class.getMethod(Class.java:1784) at sun.launcher.LauncherHelper.validateMainClass(LauncherHelper.java:544) at sun.launcher.LauncherHelper.checkAndLoadMain(LauncherHelper.java:526) Caused by: java.lang.ClassNotFoundException: org.apache.hadoop.fs.FSDataOutputStream at java.net.URLClassLoader.findClass(URLClassLoader.java:382) at java.lang.ClassLoader.loadClass(ClassLoader.java:424) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:349) at java.lang.ClassLoader.loadClass(ClassLoader.java:357) ... 7 more
Is the same problem? https://github.com/tony-framework/TonY/issues/672
It looks the nodemanager machine don't have the complete hadoop environment.
Got it, I have updated hadoop environment, and it reported python error as the below.

The error: ModuleNotFoundError: No module named 'contextlib'
You should package your pyenv zip at linux system machine same as the NM system. @tonywang-sh
My package pyenv is set at ubuntu 18.04 system with anaconda according to the guide https://github.com/tony-framework/TonY/tree/master/tony-examples/mnist-tensorflow. Do you have another guide about setting up nomachine system package env to package this pyenv zip? Thanks.
Conda is also OK. If you want to check whether the env is OK, you could launch it in local machine.
I used anaconda to package virtualenv python and obtained virtualenv pyenv zip, but this pyenv zip can not work at worker nodes. Is it right method?
Does this pyenv can be used in your local machine? You'd better to pre-check
It worked in local machine by using "ven/bin/python " cmd line, but failed in remote worker node by submitting task with TonY script.
I guess this is caused by your local machine' env is not consistent with the nodemanager.
If pyenv is packaged by virtualenv or anaconda, does it need to activate this pyenv python environment at the worker node, such as the comand, 'venv/bin/activate' before the task start at the worker. But I didn't find this "activate" operation in TonY project.