spark
spark copied to clipboard
[SPARK-40364][CORE] Use the unified `DBProvider#initDB ` method
What changes were proposed in this pull request?
There are 2 initDB
in DBProvider
-
DB initDB(DBBackend dbBackend, File dbFile, StoreVersion version, ObjectMapper mapper)
-
DB initDB(DBBackend dbBackend, File file)
The first method is used in production code and and the second one only used by ShuffleTestAccessor
for YarnShuffleIntegrationSuite
as follows:
https://github.com/apache/spark/blob/0be27b6735de0edaccbc3c027b2ff9082a180c95/resource-managers/yarn/src/test/scala/org/apache/spark/deploy/yarn/YarnShuffleIntegrationSuite.scala#L151-L170
This pr change to use the first method instead of second one.
Why are the changes needed?
Code clean up.
Does this PR introduce any user-facing change?
No
How was this patch tested?
Pass Github Actions
cc @mridulm @tgravescs and @squito for further discussion, the previous thread is here https://github.com/apache/spark/pull/37648#discussion_r963271711
GA failed not related to current pr, need wait https://github.com/apache/spark/pull/37815
GHCR recovered just do a retriggered
friendly ping @sunchao, do you know MiniYARNCluster
can start with YarnConfiguration.NM_RECOVERY_ENABLED = true
? I try to set this to YarnConfig, but MiniYARNCluster
start failed:
2022-09-10T11:44:42.1710230Z Cause: java.lang.ClassNotFoundException: org.apache.hadoop.shaded.org.iq80.leveldb.DBException
2022-09-10T11:44:42.1715234Z at java.net.URLClassLoader.findClass(URLClassLoader.java:387)
2022-09-10T11:44:42.1719347Z at java.lang.ClassLoader.loadClass(ClassLoader.java:419)
2022-09-10T11:44:42.1723090Z at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:352)
2022-09-10T11:44:42.1726759Z at java.lang.ClassLoader.loadClass(ClassLoader.java:352)
2022-09-10T11:44:42.1731028Z at org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartRecoveryStore(NodeManager.java:313)
2022-09-10T11:44:42.1735424Z at org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit(NodeManager.java:370)
2022-09-10T11:44:42.1740303Z at org.apache.hadoop.service.AbstractService.init(AbstractService.java:164)
2022-09-10T11:44:42.1745576Z at org.apache.hadoop.yarn.server.MiniYARNCluster$NodeManagerWrapper.serviceInit(MiniYARNCluster.java:597)
2022-09-10T11:44:42.1828858Z at org.apache.hadoop.service.AbstractService.init(AbstractService.java:164)
2022-09-10T11:44:42.1829712Z at org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:109)
2022-09-10T11:44:42.1830633Z at org.apache.hadoop.yarn.server.MiniYARNCluster.serviceInit(MiniYARNCluster.java:327)
2022-09-10T11:44:42.1831431Z at org.apache.hadoop.service.AbstractService.init(AbstractService.java:164)
2022-09-10T11:44:42.1832279Z at org.apache.spark.deploy.yarn.BaseYarnClusterSuite.beforeAll(BaseYarnClusterSuite.scala:112)
I found org.iq80.leveldb.*
relocated to org.apache.hadoop.shaded.org.iq80.leveldb
in NMLeveldbStateStoreService
, but it not shaded to hadoop-client-minicluster-3.3.4.jar.
https://github.com/apache/spark/blob/bf5103ae5bb938102850160048e2a1656e648244/resource-managers/yarn/src/test/scala/org/apache/spark/deploy/yarn/YarnShuffleIntegrationSuite.scala#L46-L54
add yarnConfig.setBoolean(YarnConfiguration.NM_RECOVERY_ENABLED, true)
to above newYarnConfig
method, and test with hadoop-2.7
build/mvn clean install -pl resource-managers/yarn -Pyarn -Dtest=none -DwildcardSuites=org.apache.spark.deploy.yarn.YarnShuffleIntegrationWithLevelDBBackendSuite -Phadoop-2
then failed as follows:
YarnShuffleIntegrationWithLevelDBBackendSuite:
org.apache.spark.deploy.yarn.YarnShuffleIntegrationWithLevelDBBackendSuite *** ABORTED ***
java.lang.IllegalArgumentException: Cannot support recovery with an ephemeral server port. Check the setting of yarn.nodemanager.address
at org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl.serviceStart(ContainerManagerImpl.java:395)
at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
at org.apache.hadoop.service.CompositeService.serviceStart(CompositeService.java:120)
at org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceStart(NodeManager.java:272)
at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
at org.apache.hadoop.yarn.server.MiniYARNCluster$NodeManagerWrapper.serviceStart(MiniYARNCluster.java:560)
at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
at org.apache.hadoop.service.CompositeService.serviceStart(CompositeService.java:120)
at org.apache.hadoop.yarn.server.MiniYARNCluster.serviceStart(MiniYARNCluster.java:278)
at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
...
Run completed in 3 seconds, 992 milliseconds.
Total number of tests run: 0
Suites: completed 1, aborted 1
Tests: succeeded 0, failed 0, canceled 0, ignored 0, pending 0
*** 1 SUITE ABORTED ***
It seems that we need to use a static port to support testing recovery
So I think assert(!ShuffleTestAccessor.reloadRegisteredExecutors(dbBackend, execStateCopy).isEmpty)
and the relevant code is an unreachable code currently... @tgravescs @mridulm
Thanks for the ping @LuciferYang , I'll take a look at the shading issue in Hadoop 3.3.4.
thanks @sunchao
I am unsure of this test - will rely on @tgravescs to comment better.
@tgravescs I tried to add yarnConfig.setBoolean(YarnConfiguration.NM_RECOVERY_ENABLED, true)
to make registeredExecFile not null and re-enable the test to verify execStateCopy's reload, but it failed due to the limitation of MiniYARNCluster itself. Do we have any other way to make the test of execStateCopy reload check
work again?
The only other way I can think of is to have recovery enabled when a test config is set (spark.testing.XXXX). If that looks like a hassle then I would say we just ignore the test as one of those things that we can't unit test easily
The only other way I can think of is to have recovery enabled when a test config is set (spark.testing.XXXX). If that looks like a hassle then I would say we just ignore the test as one of those things that we can't unit test easily
Let me try spark.testing.XXXX
The only other way I can think of is to have recovery enabled when a test config is set (spark.testing.XXXX). If that looks like a hassle then I would say we just ignore the test as one of those things that we can't unit test easily
Sorry for missing this comment