hadoop icon indicating copy to clipboard operation
hadoop copied to clipboard

MAPREDUCE-7465. Add support for parallelism in FileOutputCommiter via 'mapreduce.fileoutputcommitter.parallel.threshold'

Open Arnaud-Nauwynck opened this issue 1 year ago • 2 comments

see https://issues.apache.org/jira/browse/MAPREDUCE-7465

when commiting a big hadoop job (for example via Spark) having many partitions, the class FileOutputCommiter process thousands of dirs/files to rename with a single Thread. This is performance issue, caused by lot of waits on FileStystem storage operations.

I propose that above a configurable threshold (default=3, configurable via property 'mapreduce.fileoutputcommitter.parallel.threshold'), the class FileOutputCommiter process the list of files to rename using parallel threads, using the default jvm ExecutorService (ForkJoinPool.commonPool())

Arnaud-Nauwynck avatar Dec 23 '23 10:12 Arnaud-Nauwynck

:broken_heart: -1 overall

Vote Subsystem Runtime Logfile Comment
+0 :ok: reexec 0m 20s Docker mode activated.
_ Prechecks _
+1 :green_heart: dupname 0m 0s No case conflicting files found.
+0 :ok: codespell 0m 0s codespell was not available.
+0 :ok: detsecrets 0m 0s detect-secrets was not available.
+1 :green_heart: @author 0m 0s The patch does not contain any @author tags.
-1 :x: test4tests 0m 0s The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch.
_ trunk Compile Tests _
+1 :green_heart: mvninstall 31m 41s trunk passed
+1 :green_heart: compile 0m 23s trunk passed with JDK Ubuntu-11.0.21+9-post-Ubuntu-0ubuntu120.04
+1 :green_heart: compile 0m 20s trunk passed with JDK Private Build-1.8.0_392-8u392-ga-1~20.04-b08
+1 :green_heart: checkstyle 0m 21s trunk passed
+1 :green_heart: mvnsite 0m 28s trunk passed
+1 :green_heart: javadoc 0m 21s trunk passed with JDK Ubuntu-11.0.21+9-post-Ubuntu-0ubuntu120.04
+1 :green_heart: javadoc 0m 18s trunk passed with JDK Private Build-1.8.0_392-8u392-ga-1~20.04-b08
+1 :green_heart: spotbugs 0m 53s trunk passed
+1 :green_heart: shadedclient 19m 33s branch has no errors when building and testing our client artifacts.
_ Patch Compile Tests _
+1 :green_heart: mvninstall 0m 17s the patch passed
+1 :green_heart: compile 0m 18s the patch passed with JDK Ubuntu-11.0.21+9-post-Ubuntu-0ubuntu120.04
+1 :green_heart: javac 0m 18s the patch passed
+1 :green_heart: compile 0m 18s the patch passed with JDK Private Build-1.8.0_392-8u392-ga-1~20.04-b08
+1 :green_heart: javac 0m 18s the patch passed
+1 :green_heart: blanks 0m 0s The patch has no blanks issues.
-0 :warning: checkstyle 0m 15s /results-checkstyle-hadoop-mapreduce-project_hadoop-mapreduce-client_hadoop-mapreduce-client-core.txt hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core: The patch generated 18 new + 15 unchanged - 0 fixed = 33 total (was 15)
+1 :green_heart: mvnsite 0m 18s the patch passed
+1 :green_heart: javadoc 0m 13s the patch passed with JDK Ubuntu-11.0.21+9-post-Ubuntu-0ubuntu120.04
+1 :green_heart: javadoc 0m 14s the patch passed with JDK Private Build-1.8.0_392-8u392-ga-1~20.04-b08
+1 :green_heart: spotbugs 0m 52s the patch passed
+1 :green_heart: shadedclient 19m 30s patch has no errors when building and testing our client artifacts.
_ Other Tests _
+1 :green_heart: unit 5m 22s hadoop-mapreduce-client-core in the patch passed.
+1 :green_heart: asflicense 0m 23s The patch does not generate ASF License warnings.
84m 20s
Subsystem Report/Notes
Docker ClientAPI=1.43 ServerAPI=1.43 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6378/1/artifact/out/Dockerfile
GITHUB PR https://github.com/apache/hadoop/pull/6378
Optional Tests dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets
uname Linux 54d489c0a80f 5.15.0-88-generic #98-Ubuntu SMP Mon Oct 2 15:18:56 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
Build tool maven
Personality dev-support/bin/hadoop.sh
git revision trunk / 0b18b3bedb9269bc7299cff42499354b95d61314
Default Java Private Build-1.8.0_392-8u392-ga-1~20.04-b08
Multi-JDK versions /usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.21+9-post-Ubuntu-0ubuntu120.04 /usr/lib/jvm/java-8-openjdk-amd64:Private Build-1.8.0_392-8u392-ga-1~20.04-b08
Test Results https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6378/1/testReport/
Max. process+thread count 1648 (vs. ulimit of 5500)
modules C: hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core U: hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core
Console output https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6378/1/console
versions git=2.25.1 maven=3.6.3 spotbugs=4.2.2
Powered by Apache Yetus 0.14.0 https://yetus.apache.org

This message was automatically generated.

hadoop-yetus avatar Dec 23 '23 12:12 hadoop-yetus

@Arnaud-Nauwynck just stuck up #6399 which is rajesh's impl with my reviews in too. I'm not going to merge that into hadoop either because it's still got so many problems on cloud storage, especially abfs throttling. best to embrace the manifest committer and complain if you hit problems.

steveloughran avatar Jan 01 '24 20:01 steveloughran

We're closing this stale PR because it has been open for 100 days with no activity. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable. If you feel like this was a mistake, or you would like to continue working on it, please feel free to re-open it and ask for a committer to remove the stale tag and review again. Thanks all for your contribution.

github-actions[bot] avatar Oct 06 '25 00:10 github-actions[bot]