tez icon indicating copy to clipboard operation
tez copied to clipboard

TEZ-4285: Drop Commons Codec Dependency

Open belugabehr opened this issue 3 years ago • 3 comments

belugabehr avatar May 18 '21 16:05 belugabehr

:broken_heart: -1 overall

Vote Subsystem Runtime Comment
+0 :ok: reexec 16m 40s Docker mode activated.
_ Prechecks _
+1 :green_heart: dupname 0m 0s No case conflicting files found.
+1 :green_heart: @author 0m 0s The patch does not contain any @author tags.
+1 :green_heart: test4tests 0m 0s The patch appears to include 1 new or modified test files.
_ master Compile Tests _
+0 :ok: mvndep 4m 59s Maven dependency ordering for branch
+1 :green_heart: mvninstall 8m 35s master passed
+1 :green_heart: compile 4m 22s master passed with JDK Ubuntu-11.0.11+9-Ubuntu-0ubuntu2.20.04
+1 :green_heart: compile 3m 53s master passed with JDK Private Build-1.8.0_292-8u292-b10-0ubuntu1~20.04-b10
+1 :green_heart: checkstyle 2m 40s master passed
+1 :green_heart: javadoc 3m 57s master passed with JDK Ubuntu-11.0.11+9-Ubuntu-0ubuntu2.20.04
+1 :green_heart: javadoc 3m 21s master passed with JDK Private Build-1.8.0_292-8u292-b10-0ubuntu1~20.04-b10
+0 :ok: spotbugs 5m 55s Used deprecated FindBugs config; considering switching to SpotBugs.
+0 :ok: findbugs 5m 51s root in master has 2 extant findbugs warnings.
_ Patch Compile Tests _
+0 :ok: mvndep 0m 13s Maven dependency ordering for patch
+1 :green_heart: mvninstall 5m 22s the patch passed
+1 :green_heart: compile 4m 21s the patch passed with JDK Ubuntu-11.0.11+9-Ubuntu-0ubuntu2.20.04
+1 :green_heart: javac 4m 21s the patch passed
+1 :green_heart: compile 3m 53s the patch passed with JDK Private Build-1.8.0_292-8u292-b10-0ubuntu1~20.04-b10
+1 :green_heart: javac 3m 53s the patch passed
+1 :green_heart: checkstyle 2m 16s the patch passed
+1 :green_heart: whitespace 0m 0s The patch has no whitespace issues.
+1 :green_heart: xml 0m 4s The patch has no ill-formed XML file.
+1 :green_heart: javadoc 3m 51s the patch passed with JDK Ubuntu-11.0.11+9-Ubuntu-0ubuntu2.20.04
+1 :green_heart: javadoc 3m 22s the patch passed with JDK Private Build-1.8.0_292-8u292-b10-0ubuntu1~20.04-b10
+1 :green_heart: findbugs 10m 23s the patch passed
_ Other Tests _
+1 :green_heart: unit 2m 2s tez-api in the patch passed.
+1 :green_heart: unit 5m 6s tez-runtime-library in the patch passed.
+1 :green_heart: unit 4m 15s tez-dag in the patch passed.
-1 :x: unit 57m 12s tez-tests in the patch failed.
-1 :x: unit 69m 25s root in the patch failed.
+1 :green_heart: asflicense 1m 41s The patch does not generate ASF License warnings.
234m 3s
Subsystem Report/Notes
Docker ClientAPI=1.41 ServerAPI=1.41 base: https://ci-hadoop.apache.org/job/tez-multibranch/job/PR-125/1/artifact/out/Dockerfile
GITHUB PR https://github.com/apache/tez/pull/125
JIRA Issue TEZ-4285
Optional Tests dupname asflicense javac javadoc unit xml compile spotbugs findbugs checkstyle
uname Linux 11f4c517bf83 4.15.0-128-generic #131-Ubuntu SMP Wed Dec 9 06:57:35 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux
Build tool maven
Personality personality/tez.sh
git revision master / 01847814d
Default Java Private Build-1.8.0_292-8u292-b10-0ubuntu1~20.04-b10
Multi-JDK versions /usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.11+9-Ubuntu-0ubuntu2.20.04 /usr/lib/jvm/java-8-openjdk-amd64:Private Build-1.8.0_292-8u292-b10-0ubuntu1~20.04-b10
unit https://ci-hadoop.apache.org/job/tez-multibranch/job/PR-125/1/artifact/out/patch-unit-tez-tests.txt
unit https://ci-hadoop.apache.org/job/tez-multibranch/job/PR-125/1/artifact/out/patch-unit-root.txt
Test Results https://ci-hadoop.apache.org/job/tez-multibranch/job/PR-125/1/testReport/
Max. process+thread count 1210 (vs. ulimit of 5500)
modules C: tez-api tez-runtime-library tez-dag tez-tests . U: .
Console output https://ci-hadoop.apache.org/job/tez-multibranch/job/PR-125/1/console
versions git=2.25.1 maven=3.6.3 findbugs=3.0.1
Powered by Apache Yetus 0.12.0 https://yetus.apache.org

This message was automatically generated.

hadoop-yetus avatar May 18 '21 20:05 hadoop-yetus

@jteagles Just wanted to get your temperature on this change. Thanks!

belugabehr avatar Jul 27 '21 15:07 belugabehr

@belugabehr, In general, it could be a good change. Couple of thing on my mind about this.

I have been thinking about replacing apache base64 due to performance after reading http://java-performance.info/base64-encoding-and-decoding-performance/

and seeing native java8 performance.

However, as Hadoop depends directly on commons-codec, thing won't truly rid tez of the transitive dependency, only the direct dependency.

It's really important to switch from apache to java.util.Base64 in the backwards compatible way The org.apache.commons.codec.binary.Base64 encoder says it returns an RFC 2045 compliant encoder. https://commons.apache.org/proper/commons-codec/apidocs/org/apache/commons/codec/binary/Base64.html

java.util.Base64.getEncoder() returns an RFC 4648 compliant encoder, which uses the same alphabet an RFC 2045, but without line lengths and line endings. https://docs.oracle.com/javase/8/docs/api/java/util/Base64.html

I haven't checked this with code, but they will need to be exactly the same as, the shuffle handler could be running a different version of tez than the job. So they have to be backwards compatible.

I thought Base64.getMimeEncoder() will give a RFC 2045 compliant encoder in java.util.Base64 (76 line lengths with CRLF endings), but they do not match

  @Test
  public void testCodecEquivalence() throws Exception {
    byte[] bytes = new byte[100];
    String enc1 = Base64.getEncoder().encodeToString(bytes);
    String enc2 = new String(org.apache.commons.codec.binary.Base64.encodeBase64(bytes));
    String enc3 = Base64.getMimeEncoder().encodeToString(bytes);
    System.out.println("enc1 = <" + enc1 + ">");
    System.out.println("enc2 = <" + enc2 + ">");
    System.out.println("enc3 = <" + enc3 + ">");
    System.out.println(enc1.equals(enc2));
    System.out.println(enc1.equals(enc3));
  }
true
false

So perhaps line lengths are skipped with apache and makes it RFC 4648 compatible.

As for the sha384 bit, that's more complicated. I would rather not add more dependency on guava until we are using a shaded version as that library has not shown very good API stability and causes uses issues. That being said, if you look at how the sha384 is being used, it is very much overkill. In DAPAppMaster.isSameFile, two files are being read fully to compute sha384 and then SHAs are compared to see if they are equal. Same for DAG.verifyLocalResources. Perhaps better would be just to walk both file streams and stop when they differ. It will give an earlier exit for failed case. Sha384 seems just an added complication. What do you think?

jteagles avatar Aug 10 '21 20:08 jteagles