spark
spark copied to clipboard
[SPARK-33573][SHUFFLE][YARN] Shuffle server side metrics for Push-based shuffle
What changes were proposed in this pull request?
This is one of the patches for SPARK-33235: Push-based Shuffle Improvement Tasks.
Added a class PushMergeMetrics, to collect below metrics from shuffle server side for Push-based shuffle:
- no opportunity responses
- too late responses
- pushed bytes written
- cached block bytes
Why are the changes needed?
This helps to understand the push based shuffle metrics from shuffle server side.
Does this PR introduce any user-facing change?
No.
How was this patch tested?
Added a method verifyMetrics to verify those metrics in existing unit tests.
cc @zhouyejoe @otterc @mridulm @xkrogen. Please take a look, thanks.
Can one of the admins verify this patch?
There were some test failures in YarnShuffleServiceSuite from some of the UTs adde by me in the NodeManager work preserving restart PR. Reran the PR test job to see whether it is consistent failing. @rmcyang Are you able to reproduce the UT failures locally?
@zhouyejoe Hm not really, running those UTs locally looks to me.
minyang@minyang-mn3 spark % build/mvn -Dtest=none -Pyarn -Phadoop-2.10 -Dscala-2.12 -Phive -Phive-2.3 -DwildcardSuites=org.apache.spark.network.yarn.YarnShuffleServiceWithLevelDBBackendSuite test -pl 'resource-managers/yarn'
Using `mvn` from path: /Users/minyang/project/OSS-linkedin/spark/build/apache-maven-3.8.6/bin/mvn
...
[INFO] Skipping execution of surefire because it has already been run for this configuration
[INFO]
[INFO] --- scalatest-maven-plugin:2.1.0:test (test) @ spark-yarn_2.12 ---
[INFO] ScalaTest report directory: /Users/minyang/project/OSS-linkedin/spark/resource-managers/yarn/target/surefire-reports
Discovery starting.
Discovery completed in 670 milliseconds.
Run starting. Expected test count is: 17
YarnShuffleServiceWithLevelDBBackendSuite:
- executor and merged shuffle state kept across NM restart
- removed applications should not be in registered executor file and merged shuffle file
- shuffle service should be robust to corrupt registered executor file
- get correct recovery path
- moving recovery file from NM local dir to recovery path
- service throws error if cannot start
- Consistency in AppPathInfo between in-memory hashmap and the DB
- Finalized merged shuffle are written into DB and cleaned up after application stopped
- Dangling finalized merged partition info in DB will be removed during restart
- Dangling application path or shuffle information in DB will be removed during restart
- Cleanup for former attempts local path info should be triggered in applicationRemoved
- recovery db should not be created if NM recovery is not enabled
- SPARK-31646: metrics should be registered into Node Manager's metrics system
- SPARK-34828: metrics should be registered with configured name
- create default merged shuffle file manager instance
- create remote block push resolver instance
- invalid class name of merge manager will use noop instance
Run completed in 2 seconds, 944 milliseconds.
Total number of tests run: 17
Suites: completed 2, aborted 0
Tests: succeeded 17, failed 0, canceled 0, ignored 0, pending 0
All tests passed.
[INFO] ------------------------------------------------------------------------
[INFO] BUILD SUCCESS
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 24.019 s
[INFO] Finished at: 2022-08-30T02:58:38-07:00
[INFO] ------------------------------------------------------------------------
[WARNING] The requested profile "hadoop-2.10" could not be activated because it does not exist.
[WARNING] The requested profile "hive" could not be activated because it does not exist.
[WARNING] The requested profile "hive-2.3" could not be activated because it does not exist.
+CC @otterc, @Ngone51
Can you fix the linter error @rmcyang ?
src/main/java/org/apache/spark/network/shuffle/RemoteBlockPushResolver.java:[1281] (sizes) LineLength: Line is longer than 100 characters (found 104).
Merged to master. Thanks for working on this @rmcyang ! Thanks for the reviews @zhouyejoe, @otterc :-)