tez icon indicating copy to clipboard operation
tez copied to clipboard

TEZ-4440. When tez app run in yarn fed cluster, may throw NPE

Open zhengchenyu opened this issue 3 years ago • 1 comments

https://issues.apache.org/jira/browse/TEZ-4440

zhengchenyu avatar Aug 03 '22 09:08 zhengchenyu

:broken_heart: -1 overall

Vote Subsystem Runtime Comment
+0 :ok: reexec 32m 42s Docker mode activated.
_ Prechecks _
+1 :green_heart: dupname 0m 0s No case conflicting files found.
+1 :green_heart: @author 0m 0s The patch does not contain any @author tags.
-1 :x: test4tests 0m 0s The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch.
_ master Compile Tests _
+1 :green_heart: mvninstall 15m 0s master passed
+1 :green_heart: compile 0m 59s master passed with JDK Private Build-11.0.15+10-Ubuntu-0ubuntu0.20.04.1
+1 :green_heart: compile 0m 55s master passed with JDK Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07
+1 :green_heart: checkstyle 1m 32s master passed
+1 :green_heart: javadoc 1m 3s master passed with JDK Private Build-11.0.15+10-Ubuntu-0ubuntu0.20.04.1
+1 :green_heart: javadoc 0m 53s master passed with JDK Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07
+0 :ok: spotbugs 1m 53s Used deprecated FindBugs config; considering switching to SpotBugs.
+1 :green_heart: findbugs 1m 51s master passed
_ Patch Compile Tests _
+1 :green_heart: mvninstall 0m 27s the patch passed
+1 :green_heart: compile 0m 30s the patch passed with JDK Private Build-11.0.15+10-Ubuntu-0ubuntu0.20.04.1
+1 :green_heart: javac 0m 30s the patch passed
+1 :green_heart: compile 0m 27s the patch passed with JDK Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07
+1 :green_heart: javac 0m 27s the patch passed
+1 :green_heart: checkstyle 0m 25s the patch passed
+1 :green_heart: whitespace 0m 0s The patch has no whitespace issues.
+1 :green_heart: javadoc 0m 24s the patch passed with JDK Private Build-11.0.15+10-Ubuntu-0ubuntu0.20.04.1
+1 :green_heart: javadoc 0m 23s the patch passed with JDK Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07
+1 :green_heart: findbugs 1m 10s the patch passed
_ Other Tests _
+1 :green_heart: unit 5m 23s tez-dag in the patch passed.
+1 :green_heart: asflicense 0m 16s The patch does not generate ASF License warnings.
65m 20s
Subsystem Report/Notes
Docker ClientAPI=1.41 ServerAPI=1.41 base: https://ci-hadoop.apache.org/job/tez-multibranch/job/PR-235/1/artifact/out/Dockerfile
GITHUB PR https://github.com/apache/tez/pull/235
Optional Tests dupname asflicense javac javadoc unit spotbugs findbugs checkstyle compile
uname Linux b1c11f8cf5dd 4.15.0-175-generic #184-Ubuntu SMP Thu Mar 24 17:48:36 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
Build tool maven
Personality personality/tez.sh
git revision master / 621a83152
Default Java Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07
Multi-JDK versions /usr/lib/jvm/java-11-openjdk-amd64:Private Build-11.0.15+10-Ubuntu-0ubuntu0.20.04.1 /usr/lib/jvm/java-8-openjdk-amd64:Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07
Test Results https://ci-hadoop.apache.org/job/tez-multibranch/job/PR-235/1/testReport/
Max. process+thread count 228 (vs. ulimit of 5500)
modules C: tez-dag U: tez-dag
Console output https://ci-hadoop.apache.org/job/tez-multibranch/job/PR-235/1/console
versions git=2.25.1 maven=3.6.3 findbugs=3.0.1
Powered by Apache Yetus 0.12.0 https://yetus.apache.org

This message was automatically generated.

tez-yetus avatar Aug 03 '22 10:08 tez-yetus

thanks for this patch @zhengchenyu!

can you include a unit test to TestTaskScheduler which confirms that a TaskScheduler returns Resource(0,0) even if the RM client returned null?

I'm not familiar with yarn federation, but defaulting to Resource(0,0) makes sense in edge cases can you please clarify if this is specific to yarn federation or can happen without yarn federation too? (it has never been reported yet) why does it return null? does it reflect the state of a specific RM or the whole cluster of RMs?

abstractdog avatar Aug 19 '22 09:08 abstractdog

thanks for this patch @zhengchenyu!

can you include a unit test to TestTaskScheduler which confirms that a TaskScheduler returns Resource(0,0) even if the RM client returned null?

I'm not familiar with yarn federation, but defaulting to Resource(0,0) makes sense in edge cases can you please clarify if this is specific to yarn federation or can happen without yarn federation too? (it has never been reported yet) why does it return null? does it reflect the state of a specific RM or the whole cluster of RMs?

It happen only in yarn federation, will never happen without yarn federation. In fact, YARN-8933 have fix it. After apply YARN-8933, it will never happen in yarn federation. I don't know it is necessary to continue it. Because it is not a problem for latest hadoop version, but still a problem for some popular version (For example: hadoop-3.2.1). If you think it is necessary, I will add some unit test. If you think it is not necessary, I will close it.

For why return null in yarn federation?

It is another issue about yarn. Yarn router use some async thread to connect rm. When all down streaming resourcemanager timeout, yarn router may return null. But After YARN-8933, will return Resource(0,0).

zhengchenyu avatar Aug 19 '22 10:08 zhengchenyu

thanks @zhengchenyu, after reading YARN-8933 this definitely makes sense I don't insist on adding a unit test as we're "fixing" a yarn issue here, which is not present anymore after YARN-8933

abstractdog avatar Aug 19 '22 11:08 abstractdog