spark icon indicating copy to clipboard operation
spark copied to clipboard

[SPARK-40364][CORE] Use the unified `DBProvider#initDB ` method

Open LuciferYang opened this issue 2 years ago • 13 comments

What changes were proposed in this pull request?

There are 2 initDB in DBProvider

  1. DB initDB(DBBackend dbBackend, File dbFile, StoreVersion version, ObjectMapper mapper)
  2. DB initDB(DBBackend dbBackend, File file)

The first method is used in production code and and the second one only used by ShuffleTestAccessor for YarnShuffleIntegrationSuite as follows:

https://github.com/apache/spark/blob/0be27b6735de0edaccbc3c027b2ff9082a180c95/resource-managers/yarn/src/test/scala/org/apache/spark/deploy/yarn/YarnShuffleIntegrationSuite.scala#L151-L170

This pr change to use the first method instead of second one.

Why are the changes needed?

Code clean up.

Does this PR introduce any user-facing change?

No

How was this patch tested?

Pass Github Actions

LuciferYang avatar Sep 08 '22 04:09 LuciferYang

cc @mridulm @tgravescs and @squito for further discussion, the previous thread is here https://github.com/apache/spark/pull/37648#discussion_r963271711

LuciferYang avatar Sep 08 '22 04:09 LuciferYang

GA failed not related to current pr, need wait https://github.com/apache/spark/pull/37815

LuciferYang avatar Sep 08 '22 05:09 LuciferYang

GHCR recovered just do a retriggered

Yikun avatar Sep 08 '22 07:09 Yikun

friendly ping @sunchao, do you know MiniYARNCluster can start with YarnConfiguration.NM_RECOVERY_ENABLED = true? I try to set this to YarnConfig, but MiniYARNCluster start failed:

2022-09-10T11:44:42.1710230Z Cause: java.lang.ClassNotFoundException: org.apache.hadoop.shaded.org.iq80.leveldb.DBException
2022-09-10T11:44:42.1715234Z at java.net.URLClassLoader.findClass(URLClassLoader.java:387)
2022-09-10T11:44:42.1719347Z at java.lang.ClassLoader.loadClass(ClassLoader.java:419)
2022-09-10T11:44:42.1723090Z at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:352)
2022-09-10T11:44:42.1726759Z at java.lang.ClassLoader.loadClass(ClassLoader.java:352)
2022-09-10T11:44:42.1731028Z at org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartRecoveryStore(NodeManager.java:313)
2022-09-10T11:44:42.1735424Z at org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit(NodeManager.java:370)
2022-09-10T11:44:42.1740303Z at org.apache.hadoop.service.AbstractService.init(AbstractService.java:164)
2022-09-10T11:44:42.1745576Z at org.apache.hadoop.yarn.server.MiniYARNCluster$NodeManagerWrapper.serviceInit(MiniYARNCluster.java:597)
2022-09-10T11:44:42.1828858Z at org.apache.hadoop.service.AbstractService.init(AbstractService.java:164)
2022-09-10T11:44:42.1829712Z at org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:109)
2022-09-10T11:44:42.1830633Z at org.apache.hadoop.yarn.server.MiniYARNCluster.serviceInit(MiniYARNCluster.java:327)
2022-09-10T11:44:42.1831431Z at org.apache.hadoop.service.AbstractService.init(AbstractService.java:164)
2022-09-10T11:44:42.1832279Z at org.apache.spark.deploy.yarn.BaseYarnClusterSuite.beforeAll(BaseYarnClusterSuite.scala:112)

I found org.iq80.leveldb.* relocated to org.apache.hadoop.shaded.org.iq80.leveldb in NMLeveldbStateStoreService, but it not shaded to hadoop-client-minicluster-3.3.4.jar.

LuciferYang avatar Sep 10 '22 13:09 LuciferYang

https://github.com/apache/spark/blob/bf5103ae5bb938102850160048e2a1656e648244/resource-managers/yarn/src/test/scala/org/apache/spark/deploy/yarn/YarnShuffleIntegrationSuite.scala#L46-L54

add yarnConfig.setBoolean(YarnConfiguration.NM_RECOVERY_ENABLED, true) to above newYarnConfig method, and test with hadoop-2.7

build/mvn clean install -pl resource-managers/yarn -Pyarn -Dtest=none -DwildcardSuites=org.apache.spark.deploy.yarn.YarnShuffleIntegrationWithLevelDBBackendSuite -Phadoop-2

then failed as follows:

YarnShuffleIntegrationWithLevelDBBackendSuite:
org.apache.spark.deploy.yarn.YarnShuffleIntegrationWithLevelDBBackendSuite *** ABORTED ***
  java.lang.IllegalArgumentException: Cannot support recovery with an ephemeral server port. Check the setting of yarn.nodemanager.address
  at org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl.serviceStart(ContainerManagerImpl.java:395)
  at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
  at org.apache.hadoop.service.CompositeService.serviceStart(CompositeService.java:120)
  at org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceStart(NodeManager.java:272)
  at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
  at org.apache.hadoop.yarn.server.MiniYARNCluster$NodeManagerWrapper.serviceStart(MiniYARNCluster.java:560)
  at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
  at org.apache.hadoop.service.CompositeService.serviceStart(CompositeService.java:120)
  at org.apache.hadoop.yarn.server.MiniYARNCluster.serviceStart(MiniYARNCluster.java:278)
  at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
  ...
Run completed in 3 seconds, 992 milliseconds.
Total number of tests run: 0
Suites: completed 1, aborted 1
Tests: succeeded 0, failed 0, canceled 0, ignored 0, pending 0
*** 1 SUITE ABORTED ***

It seems that we need to use a static port to support testing recovery

LuciferYang avatar Sep 10 '22 16:09 LuciferYang

So I think assert(!ShuffleTestAccessor.reloadRegisteredExecutors(dbBackend, execStateCopy).isEmpty) and the relevant code is an unreachable code currently... @tgravescs @mridulm

LuciferYang avatar Sep 10 '22 16:09 LuciferYang

Thanks for the ping @LuciferYang , I'll take a look at the shading issue in Hadoop 3.3.4.

sunchao avatar Sep 10 '22 16:09 sunchao

thanks @sunchao

LuciferYang avatar Sep 10 '22 16:09 LuciferYang

I am unsure of this test - will rely on @tgravescs to comment better.

mridulm avatar Sep 12 '22 07:09 mridulm

@tgravescs I tried to add yarnConfig.setBoolean(YarnConfiguration.NM_RECOVERY_ENABLED, true) to make registeredExecFile not null and re-enable the test to verify execStateCopy's reload, but it failed due to the limitation of MiniYARNCluster itself. Do we have any other way to make the test of execStateCopy reload check work again?

LuciferYang avatar Sep 12 '22 08:09 LuciferYang

The only other way I can think of is to have recovery enabled when a test config is set (spark.testing.XXXX). If that looks like a hassle then I would say we just ignore the test as one of those things that we can't unit test easily

tgravescs avatar Sep 12 '22 13:09 tgravescs

The only other way I can think of is to have recovery enabled when a test config is set (spark.testing.XXXX). If that looks like a hassle then I would say we just ignore the test as one of those things that we can't unit test easily

Let me try spark.testing.XXXX

LuciferYang avatar Sep 17 '22 12:09 LuciferYang

The only other way I can think of is to have recovery enabled when a test config is set (spark.testing.XXXX). If that looks like a hassle then I would say we just ignore the test as one of those things that we can't unit test easily

Sorry for missing this comment

LuciferYang avatar Sep 17 '22 12:09 LuciferYang